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Abstract 

We study Markov decision problems where the agent does not know the 
transition probability function mapping current states and actions to future 
states. The agent has a prior belief over a set of possible transition functions 
and updates beliefs using Bayes’ rule. We allow her to be misspecified in the 
sense that the true transition probability function is not in the support of her 
prior. This problem is relevant in many economic settings but is usually not 
amenable to analysis by the researcher. We make the problem tractable by 
studying asymptotic behavior. We propose an equilibrium notion and provide 
conditions under which it characterizes steady state behavior. In the special 
case where the problem is static, equilibrium coincides with the single-agent 
version of Berk-Nash equilibrium (Esponda and Pouzo, 2016). We also discuss 
subtle issues that arise exclusively in dynamic settings due to the possibility of 
a negative value of experimentation. 


*We thank Vladimir Asriyan, Hector Chade, Xiaohong Chen, Emilio Espino, Drew Fudenberg, 
Bruce Hansen, Philippe Jehiel, Jack Porter, Philippe Rigollet, Tom Sargent, Ivan Werning, and sev¬ 
eral seminar participants for helpful comments. Esponda: Olin Business School, Washington Univer¬ 
sity in St. Louis, 1 Brookings Drive, Campus Box 1133, St. Louis, MO 63130, iesponda@wustl.edu; 
Pouzo: Department of Economics, UC Berkeley, 530-1 Evans Hall ^3880, Berkeley, CA 94720, 
dpouzo@econ.berkeley.edu. 



Contents 


1 Introduction 1 

2 Markov Decision Processes 6 

3 Subjective Markov Decision Processes 8 

3.1 Setup. 9 

3.2 Equilibrium. 10 

3.3 Correctly specified and identified SMDPs. 12 

4 Examples 13 

4.1 Monopolist with unknown dynamic demand. 13 

4.2 Search with uncertainty about future job offers. 16 

4.3 Stochastic growth with correlated shocks. 20 

5 Equilibrium foundation 22 

6 Equilibrium refinements 29 

7 Conclusion 32 

References 32 

Appendix 36 


Online Appendix 


46 








1 Introduction 


Early interest on studying the behavior of agents who hold misspecihed views of 
the world (e.g., Arrow and Green (1973), Kirman (1975), Sobel (1984), Kagel and 
Levin (1986), Nyarko (1991), Sargent (1999)) has recently been renewed by the work 
of Piccione and Rubinstein (2003), Jehiel (2005), Eyster and Rabin (2005), Jehicl 
and Koessler (2008), Esponda (2008), Esponda and Pouzo (2012, 2016), Eyster and 
Piccione (2013), Spiegler (2013, 2016a, 2016b), Heidhues et ah (2016), and Fudenberg 
et al. (2016). There are least two reasons for this interest. First, it is natural for agents 
to be uncertain about their complex environment and to represent this uncertainty 
with parsimonious parametric models that are likely to be misspecihed. Second, 
endowing agents with misspecihed models can explain how certain biases in behavior 
arise endogenously as a function of the primitives. 1 

The previous literature mostly focuses on problems that are intrinsically “static” 
in the sense that they can be viewed as repetitions of static problems where the only 
link between periods arises because the agent is learning the parameters of the model. 
Yet dynamic decision problems, where an agent chooses an action that affects a state 
variable (other than a belief), are ubiquitous in economics. The main goal of this 
paper is to provide a tractable framework to study dynamic settings where the agent 
learns with a possibly misspecihed model. 

We study a Markov Decision Process where a single agent chooses actions at 
discrete time intervals. A transition probability function describes how the agent’s 
action and the current state affects next period’s state. The current payoff is a 
function of states and actions. We assume that the agent is uncertain about the true 
transition probability function and wants to maximize expected discounted payoff. 
She has a prior belief over a set of possible transition functions, and her model is 
possibly misspecihed, meaning that we do not require the true transition probability 
function to be in the support of her prior. The agent uses Bayes’ rule to update her 
belief after observing the realized state. 

To better illustrate the main question and results, consider a dynamic savings 
problem with unknown returns, where s is current income, x is the choice of savings, 
7 r(s — x) is the payoff from current consumption, and next period’s income s' is drawn 

1 We take the misspecihed model as a primitive and assume that agents learn and behave optimally 
given their model. In contrast, Hansen and Sargent (2008) study optimal behavior of agents who 
have a preference for robustness because they are aware of the possibility of model misspecihcation. 


1 



from the distribution Q(- | s,x). The agent, however, does not know the return 
distribution Q. She has a parametric model representing the set of possible return 
distributions Qg indexed by a parameter 6 e 0. The agent has a prior /i over 0, and 
this belief is updated using Bayes’ rule based on current income, the savings decision, 
and the income realized next period, // = B(s, x, s', //), where B denotes the Bayesian 
operator and // is the posterior belief. The agent is correctly specified if the support 
of her prior includes the true return distribution Q and is misspecihed otherwise. We 
represent this problem recursively via the following Bellman equation: 


W(s, fj) = max 7r(s 

a:S[0,s] 


X 



W(s', n')Qe(ds' \ s,x)fi{d6), 


( 1 ) 


The solution to this Bellman equation determines the evolution of states, actions, 
and beliefs. A large computational literature provides algorithms that agents and 
researchers can use to approximate the solution to problems such as (1), where a 
belief is part of the state variable; see Powell (2007) for a textbook treatment. 2 The 
issue for economists, however, is that these numerical methods do not usually allow 
us to make general predictions about behavior. 

We propose to circumvent this problem by instead characterizing the agent’s 
steady state behavior and beliefs. The main question that we ask is whether we can 
replace a dynamic programming problem with learning, such as (1), by a problem 
where beliefs are not being updated, such as 


VYs) = max 7 t(s — x) + 
ze[o,s] 


V(s')Qn*(ds' | s,x) 


( 2 ) 


where /i* is the agent’s equilibrium or steady-state belief over 0 and = f e Qgfi*(dd ) 
is the corresponding subjective transition probability function. We refer to this prob¬ 
lem as a Markov Decision Process (MDP) with transition probability function Q 
The main advantage of this approach is that, provided that we can characterize the 
equilibrium belief /i*, it obviates the need to include beliefs in the state space, thus 
making the problem much more amenable to analysis. This focus on equilibrium 
behavior is indeed a distinguishing feature of economics. 

We begin by defining a notion of equilibrium to capture the steady state behavior 


2 Of course, we do not expect less sophisticated agents to apply these numerical methods. But, 
following the standard view in the literature, the dynamic programming approach is still a useful 
tool for the researcher to model the behavior of an agent facing intertemporal tradeoffs. 
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and belief of an agent who does not know the true transition probability function. 
We call this notion a Berk-Nash equilibrium because, in the special case where the 
environment is static, it collapses to the single-agent version of Berk-Nash equilib¬ 
rium, a concept introduced by Esponda and Pouzo (2016) to characterize steady state 
behavior in static environments with misspecihed agents. A strategy in an MDP is 
a mapping from states to actions; recall that beliefs are not included in the state for 
an MDP. For a given strategy and true transition probability function, the stochastic 
process for states and actions in an MDP is a Markov chain and has a correspond¬ 
ing stationary distribution that can be interpreted as the steady-state distribution 
over outcomes. A strategy and corresponding stationary distribution is a Berk-Nash 
equilibrium if there exists a belief ft* over the parameter space such that: (i) the 
strategy is optimal for an MDP with transition probability function Q^*, and (ii) /i* 
puts probability one on the set of parameter values that yield transition probability 
functions that are “closest” to the true transition probability function. The notion 
of “closest” is given by a weighted version of the Kullback-Leibler divergence that 
depends on the equilibrium stationary distribution. 

We use the framework to revisit three classic examples. These examples illustrate 
how our framework makes dynamic environments with uncertainty amenable to anal¬ 
ysis and expands the scope of the classical dynamic programming approach. First, 
we consider the classic problem of a monopolist with unknown demand function. We 
assume that demand is dynamic, so that a sale in the current period affects the like¬ 
lihood of a sale the next period. The monopolist, however, has a misspecihed model 
and believes that demand is not dynamic. We show that a monopolist who thinks 
demand is not dynamic does not necessarily set higher prices. 

The second illustrative example is a search model where a worker does not realize 
that she gets fired with higher probability in times in which it is actually harder to 
find another job. We show that she becomes pessimistic about the chances of finding 
a new job and sub-optimally accepts wage offers that are too low. 

The final example is a stochastic growth model along the lines of the problem 
represented by (1). The agent determines how much of her income to invest ev¬ 
ery period, which determines, together with an unknown productivity process, next 
period’s income. We assume that there are correlated shocks to both the agent’s 
utility and productivity, but the agent believes these shocks to be independent. If 
the shocks are positively correlated, the misspecihed agent invests more of her income 
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when productivity is low. She ends up underestimating productivity and, therefore, 
underinvesting in equilibrium. 

We then turn to providing a foundation for Berk-Nash equilibrium by studying the 
limiting behavior of a Bayesian agent who takes actions and updates her beliefs about 
the transition probability function every period. We ask if an equilibrium approach 
is appropriate in this environment, i.e., “Is it possible to characterize the steady state 
behavior of a Bayesian agent by reference to a simpler MDP in which the agent has 
fixed (though possibly incorrect) beliefs about the transition probability function?” 

The answer is yes if the agent is sufficiently impatient. But, if the agent is suffi¬ 
ciently patient, some subtle issues arise in the dynamic setting that lead to a more 
nuanced answer: The answer is yes provided that we restrict attention to steady 
states with a property we call exhaustive learning. Under exhaustive learning, the 
agent perceives that she has nothing else to learn in steady state. In the context of the 
previous example, this condition guarantees that optimal actions in problem (1) are 
also optimal in problem (2). Without exhaustive learning, an action may be optimal 
in problem (2) because the agent is not updating her beliefs. But the same action 
could be suboptimal if she were to update beliefs because, as we show in this paper, 
the value of experimentation can be negative in dynamic settings. This situation is 
not possible in static settings because the value function is only a function of beliefs 
and its convexity and the martingale property of Bayesian beliefs imply that the value 
of experimentation is always nonnegative. 

The notion of exhaustive learning motivates a natural refinement of Berk-Nash 
equilibrium in dynamic settings. This refinement, however, still allows beliefs to be 
incorrect due to lack of experimentation, which is a hallmark of the bandit (e.g., 
Rothschild (1974b), McLennan (1984), Easley and Kiefer (1988)) and self-confirming 
equilibrium (e.g., Battigalli (1987), Fudenberg and Levine (1993), Dekel et al. (2004), 
Fershtman and Pakes (2012)) literatures. Following Selten (1975), we define a further 
refinement, perfect Berk-Nash equilibrium, to characterize behavior that is robust to 
experimentation, and provide conditions for its existence. 

Our asymptotic characterization of beliefs and actions contributes to the literature 
that studies asymptotic beliefs and/or behavior under Bayesian learning. Table 1 
categorizes some of the more relevant papers in connection to our work. The table 
on the left includes papers where the agent learns from data that is exogenous in 
the sense that she does not affect the stochastic properties of the data. This topic 
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Correctly Specified _ Misspecified Correctly Specified _ Misspecified 


i.i.d. 

Schwartz [65] 

Freedman [63] 
Diaconis-Freedman [86] 

Berk [65] 

Bunke-Milhaud [98] 

Static 

Rothschild [74] A 

Gittins [79] A 

McLennan [84] A 

Easley-Kiefer [88] 

Aghion et al [91] 

Nyarko [91 ] A 

Esponda [08] A 
Esponda-Pouzo [16] 
Heidhues et al [16] A 

non-i.i.d. 

Ghosal-Van der Vaart [07] 

Shalizi [09] 

Vayanos-Rabin [10] A 

Dynamic 

Freixas [81 ] A 
Koulovatianos et al [09] A 
This paper 

Fudenberg et al [16] A 

This paper 


Exogenous Data 


Endogenous Data 


Table 1: Literature on Bayesian Learning 


has mostly been tackled by statisticians for both correctly-specified and misspecified 
models and for both i.i.d. and non-i.i.d. data. The table on the right includes papers 
where the agent learns from data that is endogenous in the sense that it is driven 
by the agent’s actions, a topic that has been studied by economists mostly in static 
settings. By static we mean that the problem reduces to a static optimization problem 
if stripped of the learning dynamics. 3 

Table 1 also differentiates between two complementary approaches to studying 
asymptotic beliefs and/or behavior. The first approach is to focus on specific settings 
and provide a complete characterization of asymptotic actions and beliefs, including 
convergence results; these papers are marked with a superscript ' in Table 1. Some 
papers pursue this approach in dynamic and correctly specified stochastic growth 
models (e.g., Freixas (1981), Koulovatianos et al. (2009)). In static misspecified set¬ 
tings, Nyarko (1991), Esponda (2008), and Hcidhues et al. (2016) study passive learn¬ 
ing problems where there is no experimentation motive. Fudenberg et al. (2016) is 
the only paper that provides a complete characterization in a dynamic decision prob¬ 
lem with active learning. 4 , 5 The second approach, which we follow in this paper and 

3 Formally, we say a problem is static if, for a fixed strategy and belief over the transition proba¬ 
bility function, outcomes (states and actions) are independent across time. 

4 Under active learning, different actions convey different amount of information and a non-myopic 
agent takes the exploitation vs. experimentation tradeoff into account. There can be passive or active 
learning in both static and dynamic settings. 

5 The environment in Fudenberg et al. (2016) is dynamic because the agent controls the drift of 
a Brownian motion, even though the only relevant state variable for optimality ends up being the 
agent’s belief. 
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we followed earlier for the static case (Esponda and Pouzo, 2016) is to study general 
settings and focus on characterizing the set of steady states. 6 

The paper is also related to the literature which provides learning foundations for 
equilibrium concepts, such as Nash or self-confirming equilibrium (see Fudenberg and 
Levine (1998) for a survey). In contrast to this literature, we consider Markov decision 
problems and allow for misspecified models. Particular types of misspecikcations have 
been studied in extensive form games. Jehicl (1995) considers the class of repeated 
alternating-move games and assumes that players only forecast a limited number of 
time periods into the future; see Jehiel (1998) for a learning foundation. We share 
the feature that the learning process takes place within the play of the game and that 
beliefs are those that provide the best fit given the data. 7 

The framework and equilibrium notion are presented in Sections 2 and 3. In 
Section 4, we work through several examples. We provide a foundation for equilibrium 
in Section 5 and study equilibrium refinements in Section 6. 

2 Markov Decision Processes 

We begin by describing the environment faced by the agent. 

Definition 1. A Markov Decision Process (MDP) is a tuple (§, X, T, q 0 , Q, it, 8) 

where 


• § is a nonempty and finite set of states 

• X is a nonempty and finite set of actions 

• T : § —>■ 2 X is a non-empty constraint correspondence 

• go £ A(§) is a probability distribution on the initial state 

• Q : Gr(T) —>• A(§) is a transition probability function 8 

• 7 r : GV(r) x § —>■ M is a per-period payoff function 

6 In macroeconomics there are several models where agents make forecasts using statistical models 
that are misspecified (e.g., Evans and Honkapohja (2001) Ch. 13, Sargent (1999) Ch. 6). 

'Jehiel and Samet (2007) consider the general class of extensive form games with perfect infor¬ 
mation and assume that players simplify the game by partitioning the nodes into similarity classes. 

8 For a correspondence F : § —>■ 2 X , its graph is defined by Gr(r) = {(s, x) £ § x X : x £ r(s)}. 
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5 e [0,1) is a discount factor 


We sometimes use MDP(Q) to denote an MDP with transition probability function 
Q and exclude the remaining primitives. 

The timing is as follows. At the beginning of each period t = 0,1,2,..., the agent 
observes state St € § and chooses a feasible action a : t e T(s t ) C X. Then a new state 
St- i-i is drawn according to the probability distribution Q(- \ St,Xt ) and the agent 
receives payoff ir(s t ,x t , s t +i) in period t. The initial state s 0 is drawn according to 
the probability distribution q 0 . 

The agent facing an MDP chooses a policy rule that specifies at each point in time 
a (possibly random) action as a function of the history of states and actions observed 
up to that point. As usual, the objective of the agent is to choose a feasible policy 
rule to maximize expected discounted utility, ^ 7r ( s t) x u St+i)- 

By the Principle of Optimality, the agent’s problem can be cast recursively as 

Vq(s) — max [ {ir(s,x,s') + 5V Q (s')} Q(ds'\s, x) (3) 

zer(s) 

where Vq : § —>• M is the (unique) solution to the Bellman equation (3). 

Definition 2. A strategy a is a distribution over actions given states, a : § —» A(X), 
that satisfies cr(s) 6 T(s) for all s. 

Let X denote the space of all strategies and let cr(x \ s ) denote the probability 
that the agent chooses x when the state is s. 9 

Definition 3. A strategy a G X is optimal for an MDP(Q) if, for all s & § and all 
x e X such that a(x | s) > 0, 

x G arg max / {7r(s, x, s') + <5Vq(s / )} Q(ds'\s, x). 

(s) 

Let X(<5) be the set of all strategies that are optimal for an MDP(Q). 

9 A standard result is the existence of a deterministic optimal strategy. Nevertheless, allowing for 
randomization will be important in the case where the transition probability function is uncertain. 
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Lemma 1. (%) There is a unique solution Vq to the Bellman equation in (3), and 
it is continuous in Q for all s G §; (ii) The correspondence of optimal strategies 
Q i—)■ £(Q) is non-empty, compact-valued, convex-valued, and upper hemicontinuous. 

Proof. The proof is standard and relegated to the Online Appendix. □ 

A strategy determines the transitions in the space of states and actions and, 
consequently, the set of stationary distributions over states and actions. For any 
strategy a and transition probability function Q , define a transition kernel : 

Gr(T) —>• A (Gr(r)) by letting 


M CiQ {s',x' I s,x) = cr(V I s')Q(s' I s,x ) (4) 

for all (s,x), ( s',x' ) G Gr(T). The transition kernel is the transition probability 
function over GV(T) given strategy a and transition probability function Q. 

For any m G A(Gr(r)), let M a> Q[m\ G A(Gr(r)) denote the probability measure 

Mt,q(v | s,x)m(s,x). 

(s,x)£Gr(T) 

Definition 4. A distribution m G A(Gr(T)) is a stationary (or invariant) dis¬ 
tribution given (cr, Q) if m = M a Q[m\. 

A stationary distribution represents the steady-state distribution over outcomes 
(i.e, states and actions) when the agent follows a given strategy. Let Iq(ct) = {m G 
A(Gr(r)) | m = M CTi Q[m]} denote the set of stationary distributions given (cr, Q). 

Lemma 2. The correspondence of stationary distributions a i —> Iq(ct) is non-empty, 
compact-valued, convex-valued, and upper hemicontinuous. 

Proof. See the Appendix. □ 

3 Subjective Markov Decision Processes 

Our main objective is to study the behavior of an agent who faces an MDP but is 
uncertain about the transition probability function. We begin by introducing a new 
object to model the problem with uncertainty, which we call the Subjective Markov 


decision process (SMDP). We then define the notion of a Berk-Nash equilibrium of 
an SMDP. 

3.1 Setup 

Definition 5. A Subjective Markov Decision Process (SMDP) is an MDP, 
(§, X, T, go, Q, 7r, S), and a nonempty family of transition probability functions, <2© = 
{Qe ■ 0 E 0}, where each transition probability function Qg : Gr(T) —y A(S) is 
indexed by a parameter 6 E 0. 

We interpret the set Qq as the different transition probability functions (or models 
of the world) that the agent considers possible. We sometimes use SMDP(Q, Qq) to 
denote an SMDP with true transition probability function Q and a family of transition 
probability functions Qq. 

Definition 6. A Regular Subjective Markov Decision Process (regular-SMDP) 
is an SMDP that satisfies the following conditions 

• 0 is a compact subset of an Euclidean space. 

• Qe( s ' | s,x) is continuous as a function of 6 E 0 for all (s',s,x) E § x GV(r). 

• There is a dense set 0 C 0 such that, for all 0 6 0, Qe(s' \ s,x) > 0 for all 
(s',s, x) E § x Gr(T) such that Q(s' \ s,x) > 0. 

The first two conditions in Definition 6 place parametric and continuity assump¬ 
tions on the subjective models. 10 The last condition plays two roles. First, it rules 
out a stark form of misspecification by guaranteeing that there exists at least one 
parameter value that can rationalize every feasible observation. Second, it implies 
that the correspondence of parameters that are a closest fit to the true model is 
upper hemicontinuous. Esponda and Pouzo (2016) provide a simple (non-dynamic) 
example where this assumption does not hold and equilibrium fails to exist. 

10 Without the assumption of a finite-dimensional parameter space, Bayesian updating need not 
converge to the truth for most priors and parameter values even in correctly specified statistical 
settings (Freedman (1963), Diaconis and Freedman (1986)). Note that the parametric assumption 
is only a restriction if the set of states or actions is nonfinite, a case we consider in some of the 
examples. 
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3.2 Equilibrium 


The goal of this section is to define the notion of Berk-Nash equilibrium of an SMDP. 
The next definition is used to place constraints on the belief /i G A(©) that the agent 
may hold if m is the stationary distribution over outcomes. 

Definition 7. The weighted Kullback-Leibler divergence (wKLD) is a mapping 
Kq\ A(Gr(T)) x 0 —>• ]R + such that for any m G A(Gr(T)) and 0 6 0, 

K Q (m,0) = 'y ^ ^Q{- |s,cc) 

(s,a:)sGr(r) 

The set of closest parameter values given m G A(Gr(T)) is the set 

©q (m) = arg min Kq (m, 9). 

0E0 

The set ©^(m) contains the parameter values constitute the best fit with the true 
transition probability function Q when outcomes are drawn from the distribution m. 


Q(S'\s,x) 
11 \Qo(S'\s,x ) 


m(s, x) 


Lemma 3. (i) For every m G A(Gr(T)) and 6 G ©, KQ(m,6) > 0, with equality 
holding if and only if Qe(- \ s,x) = Q( • | s,x) for all (s,x) such that m(s,x) > 0. 
(ii) For any regular SMDP(Q, Qq), m (->■ ©g(m) is non-empty, compact valued, and 
upper hemicontinuous. 

Proof. See the Appendix. □ 

We now define equilibrium. 

Definition 8. A strategy and probability distribution (cr, m) G E x A(Gr(T)) is a 
Berk-Nash equilibrium of the SMDP(Q, Qq) if there exists a belief p G A(©) such 
that 

(i) cr is an optimal strategy for the MDP(Q At ), where Q M = J e QopidO), 

(ii) p G A (@g(m)), and 

(iii) m G I Q (a). 
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Condition (i) in the definition of Berk-Nash equilibrium requires a to be an op¬ 
timal strategy in the MDP where the transition probability function is J e Qg/j,(dd). 
Condition (ii) requires that the agent only puts positive probability on the set of 
closest parameter values given m, @g(m). Finally, condition (iii) requires m to be a 
stationary distribution given (a,Q). 

Remark 1. In Section 5, we interpret the set of equilibria as the set of steady states of 
a learning environment where the agent is uncertain about Q. The main advantage 
of the equilibrium approach is that it allows us to replace a difficult learning problem 
with a simpler MDP with a fixed transition probability function. The cost of this 
approach is that it can only be used to characterize asymptotic behavior, as opposed 
to the actual dynamics starting from the initial distribution over states, go £ A(§). 
This explains why go does not enter the definition of equilibrium, and why a mapping 
between g 0 and the set of corresponding equilibria cannot be provided in general. 

Remark 2. In the special case of a static environment, Definition 8 reduces to Esponda 
and Pouzo’s (2016) definition of Berk-Nash equilibrium for a single agent. In the 
dynamic environment, outcomes follow a Markov process and we must keep track not 
only of strategies but also of the corresponding stationary distribution over outcomes. 

The next result establishes existence of equilibrium in any regular SMDP. 

Theorem 1. For any regular SMDP, there exists a Berk-Nash equilibrium. 

Proof. See the Appendix. □ 

The standard approach to proving existence begins by defining a “best response 
correspondence” in the space of strategies. This approach does not work here because 
the possible non-uniqueness of beliefs implies that the correspondence may not be 
convex valued. The trick we employ is to define equilibrium via a correspondence on 
the space of strategies, stationary distributions, and beliefs, and then use Lemmas 1, 
2 and 3 to show that this correspondence satisfies the assumptions of a generalized 
version of Kakutani’s fixed point theorem. 11 

11 Esponda and Pouzo (2016) rely on perturbations to show existence of equilibrium in a static 
setting. In contrast, our approach does not require the use of perturbations. 
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3.3 Correctly specified and identified SMDPs 

An SMDP is correctly specified if the set of subjective models contains the true model. 


Definition 9. An SMDP(Q, Q©) is correctly specified if Q G Q©; otherwise, it is 
misspecified. 

In decision problems, data is endogenous and so, following Esponda and Pouzo 
(2016), it is natural to consider two notions of identification: weak and strong iden¬ 
tification. These definitions distinguish between outcomes on and off the equilibrium 
path. In a dynamic environment, the right object to describe what happens on and 
off the equilibrium path is not the strategy but rather the stationary distribution over 
outcomes m. 

Definition 10. An SMDP is weakly identified given m G A(Gr(r)) if 6,6' G 
@q(m) implies that Qe(- \ s,x ) = Qefi- \ s,x ) for all ( s,x ) G Gr(r) such that 
m(s,x ) > 0; if the condition is satisfied for all ( s,x) G GV(r), we say that the 
SMDP is strongly identified given m. An SMDP is weakly (strongly) identified 
if it is weakly (strongly) identified for all m G A(Gr(r)). 

Weak identification implies that, for any equilibrium distribution m, the agent 
has a unique belief along the equilibrium path, i.e., for states and actions that occur 
with positive probability. It is a condition that turns out to be important for proving 
the existence of equilibria that are robust to experimentation (see Section 6) and is 
always satisfied in correctly specified SMDPs. 12 Strong identification strengthens the 
condition by requiring that beliefs are unique also off the equilibrium path. 

Proposition 1. Consider a correctly specified and strongly identified SMDP with cor¬ 
responding MDP(Q). A strategy and probability distribution (cr, m) G £ x A(Gr(r)) 
is a Berk-Nash equilibrium of the SMDP if and only if a is optimal given MDP(Q) 
and m is a stationary distribution given a. 

12 The following is an example where weak identification fails. Suppose an unbiased coin is tossed 
every period, but the agent believes that the coin comes up heads with probability 1/4 or 3/4, but 
not 1/2. Then both 1/4 and 3/4 minimize the Kullback-Leibler divergence, but they imply different 
distributions over outcomes. Relatedly, Berk (1966) shows that beliefs do not converge. 
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Proof. Only if: Suppose (cr,m) is a Berk-Nash equilibrium. Then there exists /i such 
that a is optimal given MDP(Q m ), fi G A(0(m)), and m G Jq(<j). Because the 
SMDP is correctly specified, there exists 9* such that Qg * = Q and, therefore, by 
Lemma 3(i), 6* G A(0(m)). Then, by strong identihcation, any 6 G 0(m) satishes 
Qg — Qe* — Q i implying that cr is also optimal given MDP(Q). If: Let m G 
where a is optimal given MDP(Q). Because the SMDP is correctly specified, there 
exists 6* such that Qg * = Q and, therefore, by Lemma 3(i), 9* G A(@(m)). Thus, a 
is also optimal given Qg *, implying that (cr, m) is a Berk-Nash equilibrium. □ 

Proposition 1 says that, in environments where the agent is uncertain about the 
transition probability function but her subjective model is both correctly specified 
and strongly identified, then Berk-Nash equilibrium corresponds to the solution of the 
MDP under correct beliefs about the transition probability function. If one drops the 
assumption that the SMDP is strongly identified, then the “if” part of the proposition 
continues to hold but the “only if” condition does not hold. In other words, there 
may be Berk-Nash equilibria of correctly-specified SMDPs in which the agent has 
incorrect beliefs off the equilibrium path. This feature of equilibrium is analogous to 
the main ideas of the bandit and self-confirming equilibrium literatures. 

4 Examples 

We use three classic examples to illustrate how easy it is to use our framework to 
expand the scope of the classical dynamic programming approach. 

4.1 Monopolist with unknown dynamic demand 

The problem of a monopolist facing an unknown, static demand function was first 
studied by Rothschild (1974b) and Nyarko (1991) in correctly and misspecihed set¬ 
tings, respectively. In the following example, the monopolist faces a dynamic demand 
function but incorrectly believes that demand is static. 

MDP: In each period t, a monopolist chooses price x t G X = {L,H}, where 
0 < L < H. It then sells s i+1 G S = (0,1} units at zero cost and obtains profit 
7r(xt,s t+ i) = x t s t + 1 . The probability that s m = 1 is q sx = Q( 1 | s t = s,x t = x), 
where 0 < q sx < 1 for all (s,x) G Gr(r) = § x X. 13 The monopolist wants to 

13 The set of feasible actions is independent of the state, i.e., T(s) = X for all s £ S. 
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maximize expected discounted profits, with discount factor 5 G [0,1). 

Demand is dynamic in the sense that a sale yesterday increases the probability of 
a sale today: q\ x > q$ x for all x G X. Moreover, a higher price reduces the probability 
of a sale: q sL > q sH for all s G §. Finally, for concreteness, we assume that 


Qil H q 0L 

qiH L q 0 H 


(5) 


Expression (5) implies that current-period profits are maximized by choosing price 
L if there was no sale last period and price H otherwise (i.e., Lq^L > Hq^u and 
Hqi H > Lq 1L ). Thus, the optimal strategy of a myopic monopolist (i.e., 5 = 0) who 
knows the primitives is cr(H | 0) = 0 and a(H | 1) = 1. If, however, the monopolist 
is sufficiently patient, it is optimal to always choose price L. 14 

SMDP. The monopolist does not know Q and believes, incorrectly, that demand 
is not dynamic. Formally, Qq = {Qg : 6 G 0}, where 0 = [0, l] 2 and, for all 
6 = ( 9l , Oh) G 0, Qe{ 1 | s,L) — 0 l and Q$(l | s,H) — 6 h for all s G §. In particular, 
0 X is the probability that a sale occurs given price x G {L , H}, and the agent believes 
that it does not depend on s. Note that this SMDP is regular. For simplicity, we 
restrict attention to equilibria in which the monopolist does not condition on last 
period’s state, and denote a strategy by an, the probability that price H is chosen. 

Equilibrium. Optimality. Because the monopolist believes that demand is static, 
the optimal strategy is to choose the price that maximizes current period’s profit. Let 

A{0) = H0 h - L0 l 


denote the perceived expected payoff difference of choosing H vs. L under the belief 
that the parameter value is 0 = ( 0l , Oh) with probability 1. If A(0) > 0, cth = 1 is 
the unique optimal strategy; if A(0) < 0, an = 0 is the unique optimal strategy; and 
if A(0) = 0, any an G [0,1] is optimal. 

Beliefs. For any m G A(S x X), the wKLD simplifies to 

K Q (m, 0) = ^2 m ^( x ) {s x (m) In 0 X + (1 - s x (m)) ln(l - 0 X )} + Const, 

x&{L,H} 


14 Formally, there exists Cg £ [qiL/qiH, <1 ol/<1oh\, where Co = qiL/qiH and <5 i—>• is increasing, 

such that, if H/L < Cg, the optimal strategy is cr{H \ 0) = <r(H | 1) = 0. 
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where s x (m ) = m§|x(0 | x)q 0x + m§|x(0 | x)qi x is the probability of a sale given x. 

If gl > 0 and oh > 0, = (Si(m),sjy(m)) is the unique parameter value 

that minimizes the wKLD function. If, however, one of the prices is chosen with 
zero probability, there are no restrictions on beliefs for the corresponding parameter, 
i.e., the set of minimizers is 0g(m) = {( 0l,0h) G 0 : 0 H = Sff(m)} if o L = 0 and 
0 Q (m) = {{0 L , 6 h ) G 0 : 0 L = s L (m)} if a H = 0. 

Stationary distribution. Fix a strategy an and denote a corresponding stationary 
distribution by m(-; cr^) G A(§ x X). Since the strategy does not depend on the state, 
m§|x(- | x ; oh) does not depend on x and, therefore, coincides with the marginal 
stationary distribution over S, denoted by m§(-;<7ff) G A(§). This distribution is 
unique and given by the solution to 


m§( 1; o H ) = (1 - m s ( 1; cr H ))(( 1 - cr H )q 0L + cr H qm) + nr§(l; oqy)((l - o H )qiL + Onqui)- 


Equilibrium. We restrict attention to equilibria that are robust to experimen¬ 
tation (i.e., perfect equilibria; see Section 6) by focusing on the belief 6(oh) = 
{0 l (cth),0h(oh)) = 0 Q (m(--,o H )) for a given strategy o H G [0,1]. 15 Next, let A(6(o H )) 
be the perceived expected payoff difference for a given strategy oh■ Note that 
Oh e->• A (6(oh)) is decreasing 16 , which means that a higher probability of choos¬ 
ing price H leads to more pessimistic beliefs about the benefit of choosing H vs. 
L. Therefore, there exists a unique (perfect) equilibrium strategy. Figure 1 depicts 
an example where the equilibrium is in mixed strategies. * 1 ' Since A(0(O)) > 0, an 
agent who always chooses a low price must believe in equilibrium that setting a high 
price would instead be optimal. Similarly, A(0(1)) < 0 implies that an agent who al¬ 
ways chooses a high price must believe in equilibrium that settings a low price would 
instead be optimal. Therefore, in equilibrium, the agent chooses a strictly mixed 
strategy o* H G (0,1) such that A(8(o#)) = 0. 18 

15 Both <jh = 0 and <jh = 1 are Berk-Nash equilibria supported by beliefs 9h{ 0) = 0 and 6l( 1) = 0, 
respectively. These outcomes, however, are not robust to experimentation, and are eliminated by 
requiring 9h{ 0) = limg-^^o s_y(m(-; u//)) = su{ni(-;0)), and similarly for 9l{ 1). 

16 The reason is that (9{ct h )) = a |^-ms(l; a H ) {H(q 1H - q 0H ) + L(q 1L - q 0L )) > 0, since 

cr H ) < 0 and q lx > q 0x for all x £ {L, H}. 

1 'See Esponda and Pouzo (2016) for the importance of mixed strategies in misspecified settings. 

18 More generally, the unique equilibrium is oh = 0 if A(0(O)) < 0 (i.e., ^ < D\ = 
aH = 1 if > 0 f > °2 = (1 - <hn)^ + «il), and a* H £ (0, 1) the 

solution to A(0(crff)) = 0 if < j- < D 2 , where < Di < D 2 < 


15 




A(0(-)) 



Figure 1: Equilibrium of the monopoly example 

The misspecihed monopolist may end up choosing higher prices than optimal, 
since she fails to realize that high prices today cost her in the future. But, a bit more 
surprisingly, she also may end up choosing lower prices for some primitives. 19 The 
reason is that her failure to realize that H does relatively better in state s = 1 makes 
H unattractive to her. 

4.2 Search with uncertainty about future job offers 

Search-theoretic models have been central to understanding labor markets since Mc¬ 
Call (1970). Most of the literature assumes that the worker knows all the primitives. 
Exceptions include Rothschild (1974a) and Burdett and Vishwanatli (1988), wherein 
the worker does not know the wage distribution but has a correctly-specified model. 
In contrast, we study a worker or entrepreneur who knows the distribution of wages or 
returns for new projects but does not know the probability that she would be able to 
find a new job or fund a new project. The worker or entrepreneur, however, does not 
realize that she is bred or her project fails with higher probability in times in which it 
is actually harder to End a new job or fund a new project. We show that the worker 
or entrepreneur becomes pessimistic about the chances of Ending new prospects and 
sub-optimally accepts prospects with low returns in equilibrium. 

MDP. At the beginning of each period t, a worker (or entrepreneur) faces a wage 
ofler (or a project with returns) w t G § = [0,1] and decides whether to reject or accept 
it, xt € X = (0, l}. 20 Her payoE in period t is tt( w t ,x t ) = w t x t ] i.e, she earns w t if 

19 This happens if C$ < H/L < see footnotes 14 and 18. 

20 The set of feasible actions is independent of the state, i.e., T(w) = X for all w £ S. 
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she accepts and zero otherwise. After making her decision, an economic fundamental 
Zt £ Z is drawn from an i.i.d. distribution G. 21 If the worker is employed, she is 
fired (or the project fails) with probability 7 (zt). If the worker is unemployed (either 
because she was employed and then bred or because she did not accept employment at 
the beginning of the period), then with probability \(z t ) she draws a new wage w t+ 1 € 
[0,1] according to some absolutely continuous distribution F with density /; wages are 
independent and identically distributed across time. With probability 1 — X(z t ), the 
unemployed worker receives no wage offer, and we denote the corresponding state by 
Wt +1 = 0 without loss of generality. The worker will have to decide whether to accept 
or reject Wt+i at the beginning of next period. If the worker accepted employment 
at wage w t at the beginning of time t and was not bred, then she starts next period 
with wage ober Wt +1 = Wt and will again have to decide whether to quit or remain 
in her job at that ober . 22 The agent wants to maximize discounted expected utility 
with discount factor 5 £ [0,1). Suppose that 7 = E\y(Z)\ > 0 and A = E[X(Z)\ > 0. 

We assume that Cov(py(Z), X(Z)) < 0; for example, the worker is more likely to 
get bred and less likely to receive an ober when economic fundamentals are strong, 
and the opposite holds when fundamentals are weak. 

SMDP. The worker knows all the primitives except A(-), which determines the 
probability of receiving an ober. The worker has a misspecihed model of the world 
and believes A(-) does not depend on the economic fundamental, i.e., A(z) = 6 for 
all z6Z, where 6 £ [0,1] is the unknown parameter. 23 The transition probability 
function Qq{w' \ w,x) is as follows: If x — 1, then w' — w with probability 1 — 9, w' 
is a draw from F with probability 6j, and w' = 0 with probability (1 — 9) 7 ; If x = 0, 
then w' is a draw from F with probability 6 and w' — 0 with probability 1 — 9. 

Equilibrium. Optimality. Suppose that the worker believes that the true param¬ 
eter is 6 with probability 1. The value of receiving wage ober w £ § is 

V(w) = max {w + 5 ((1 - 7 )V(w) + (1 - 0)^(0) + 6^E[V{W ')]), 

0 + 5 {6E[V(W')\ + (1 - 0)y(O))} . 

21 To simplify the notation, we assume the fundamental is unobserved, although the results are 
identical if it is observed, since it is i.i.d. and it is realized after the worker makes her decision. 

22 Formally, Q{w' \ w,x ) is as follows: If x = 1, then w' = w with probability 1 — 7 , w' is a draw 
from F with probability E[y(Z)\{Z)\, and w' = 0 with probability E[ r ){Z){\ — A(Z))]; If x = 0, 
then w' is a draw from F with probability A and w' = 0 with probability 1 — A. 

23 The results are identical if the agent is also uncertain of y(-); given the current misspecification, 
the agent only cares about the expectation of 7 and will have correct beliefs about it. 
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By standard arguments, her optimal strategy is a stationary reservation wage strategy 
w{6 ) that solves the following equation: 


w{6){l — S + S'y) = 69(1 — 7 ) / (w — w{6)) F(dw). ( 6 ) 

J w>w(6) 

The worker accepts wages above the reservation wage and rejects wages below it. 
Also, 0 i-> w{6) is increasing: The higher is the probability of receiving a wage offer, 
then the more she is willing to wait for a better offer in the future. Figure 2 depicts 
an example. 

Beliefs. For any m € A(S x X), the wKLD simplifies to 


K Q (m,6) = 


E, 


Q(-\w,x) 


In 


Q(W' | w,x) 
Qe(W’\w,x) 


m(dw, dx ) 




Aln^ + (1 - A) In jj- 


A 


where the density of W' cancels out because the workers knows it and where is 
the marginal distribution over X. In the Online Appendix, we show that the unique 
parameter that minimizes Kq^ui, •) is 


e Q (m) = - _ x + (l- _ ) (a + 

m x (0)+mx(l)7 A m x (0) + m%{l)'y) \ 


7 


To see the intuition behind equation (7), note that the agent only observes the real¬ 
ization of A, i.e., whether she receives a wage offer, when she is unemployed. Unem¬ 
ployment can be voluntary or involuntary. In the first case, the agent rejects the offer 
and, since this decision happens before the fundamental is realized, it is independent 
of getting or not an offer. Thus, with conditional on unemployment being voluntary, 
the agent will observe an unbiased average probability of getting an offer, A (see the 
first term in the RHS of (7)). In the second case, the agent accepts the offer but is 
then fired. Since Cov( 7 , A) < 0, she is less likely to get an offer in periods in which 
she is fired and, because she does not account for this correlation, she will have a 
more pessimistic view about the probability of receiving a wage offer relative to the 
average probability A (the second term in the RHS of (7) captures this bias). 
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Figure 2: Equilibrium of the search model 

Stationary distribution. Fix a reservation wage strategy w and denote the marginal 
over X of the corresponding stationary distribution by ) G A(X). In the 

Online Appendix, we characterize mx(-; w ) and show that w H > m%{ 0; w) is increasing. 
Intuitively, the more selective the worker, the higher the chance of being unemployed. 

Equilibrium. Let 9(u) = 0q(??t(-; w)) denote the equilibrium belief for an agent 
following reservation wage strategy w. The weight on A in equation (7) represents the 
probability of voluntary unemployment conditional on unemployment. This weight is 
increasing in oj because w t—>■ mx(0]w) is increasing. Therefore, w H- 9(w) is increas¬ 
ing. In the extreme case in which w — 1, the worker rejects all offers, unemployment is 
always voluntary, and the bias disappears, 0(1) = A. An example of the schedule d(-) 
is depicted in Figure 2. The set of Berk-Nash equilibria is given by the intersection 
of w(-) and 0(-). In the example depicted in Figure 2, there is a unique equilibrium 
strategy w M = w(9 M ) : where 9 M < A. 

We conclude by comparing Berk-Nash equilibria to the optimal strategy of a 
worker who knows the primitives, w*. By standard arguments, w* is the unique 
solution to 

u>*(l — 5 + (fy) = <5(A — E[y\\) f ( w — w*)F(dw ). (8) 

J w>w* 

The only difference between equations ( 6 ) and ( 8 ) appears in the term multiplying the 
RHS, which captures the cost of accepting a wage offer. In the misspecihed case, this 
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term is S9( 1 — 7 ); in the correct case, it is <5(A — E[yX]) = <5A(1 — 7 ) — 5Cov( 7, A). The 
misspecification affects the optimal threshold in two ways. First, the misspecified 
agent estimates the mean of A incorrectly, i.e., 9 < A; therefore, she (incorrectly) 
believes that, in expectation, offers arrive with lower probability. Second, she does 
not realize that, because Cov{ 7 , A) < 0 , she is less likely to receive an offer when fired. 
Both effects go in the same direction and make the option to reject and wait for the 
possibility of drawing a new wage offer next period less attractive for the misspecified 
worker. Formally, 9S( 1 — 7 ) < 5A(1 — 7 ) — 5Cov( 7 , A) and so w M < w*. 

4.3 Stochastic growth with correlated shocks 

Stochastic growth models have been central to studying optimal intertemporal alloca¬ 
tion of capital and consumption since the work of Brock and Mirrnan (1972). Freixas 
(1981) and Koulovatianos et al. (2009) assume that agents learn the distribution over 
productivity shocks with correctly specified models. We follow Hall (1997) and sub¬ 
sequent literature in incorporating shocks to both preferences and productivity, but 
assume that these shocks are (positively) correlated. We show that agents who fail 
to account for the correlation of shocks underinvest in equilibrium. 

MDP. In each period t, an agent observes s t = (y t , z t ) G § = M + x (L, H}, where 
y t is income from the previous period and z t is a current utility shock, and chooses 
how much income to save, x t G T(y t ,z t ) = [0 ,y t \ C X = M + , consuming the rest. 
Current period utility is n(y t ,z t ,x t ) = z t ln(y t — x t ). Income next period, y t+ 1 , is 
given by 

In y t+ 1 = a* + f3* In x t + e t , (9) 

where e t = 7 *z t + ^t is an unobserved productivity shock, ~ iV(0,1), and 0 < Sf3* < 
1, where 5 G [0,1) is the discount factor. We assume that 7* > 0, so that the utility 
and productivity shocks are positively correlated. Let 0 < L < H and let q G (0,1) 
be the probability that the shock is H. 2A 

SMDP. The agent believes that 

In y t+ 1 = a + /3 In x t + £ t , (10) 

24 Formally, Q(y', z' \ y, z, x ) is such that y' and z' are independent, y' has a log-normal distribution 
with mean a* + /3* ln;r + 7 *z and unit variance, and z' = H with probability q. 
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where e t ~ N( 0,1) and is independent of the utility shock. For simplicity, we assume 
that the agent knows the distribution of the utility shock, and is uncertain about 
6 — (a, j3) e 0 = M 2 . The subjective transition probability function Qg(y' , z' \ y, z, x) 
is such that y' and z' are independent, y' has a log-normal distribution with mean 
a + [3 In x and unit variance, and and z' — H with probability q. The agent has a 
misspecified model because she believes that the productivity and utility shocks are 
independent when in fact 7 * 7 ^ 0 . 

Equilibrium. Optimality. The Bellman equation for the agent is 

V (; y , z) = max 2 In (y — x) + 5E [V(Y', Z') \ x] 

0<x<y 

and it is straightforward to verify that the optimal strategy is to invest a fraction of 
income that depends on the utility shock and the unknown parameter f3, i.e., x = 

A z {/3) • y, where A L (f3) = and Ah ^ = S+a-Si < A M- 

For the agent who knows the primitives, the optimal strategy is to invest fractions 
Al(/3*) and Ah{/3*) in the low and high state, respectively. Since (3 (->• A z ((3) is 
increasing, the equilibrium strategy of a misspecified agent can be compared to the 
optimal strategy by comparing the equilibrium belief about (3 with the true (3*. 

Beliefs and stationary distribution. Let A = (Al, Ah), with Ah < Al, represent a 
strategy, where A z is the proportion of income invested given utility shock z. Because 
the agent believes that e t is independent of the utility shock and normally distributed, 
minimizing the wKLD function is equivalent to performing an OLS regression of 
equation (10). Thus, for a strategy represented by A = (A^,Ah), the parameter 
value /3(A) that minimizes wKLD is 

~ _ Cov(lnY',lnX) _ Cov(ln Y', In A Z Y) 

^ > ~ Var (In X) “ Var(\nA z Y) 
o* , Cov(Z,\xiA z ) 

1 7 Var(hiA z ) + Var(Y)' 

where Cov and Var are taken with respect to the (true) stationary distribution of 
(Y,Z). Since Ah < Al, then Cov(Z, In Az) < 0. Therefore, the assumption that 
1* > 0 implies that the bias /3(A) — (3* is negative and its magnitude depends on the 
strategy A. Intuitively, the agent invests a larger fraction of income when z is low, 
which happens to be during times when e is also low. 
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Equilibrium. We establish that there exists at least one equilibrium with pos¬ 
itive investment by showing that there is at least one hxed point of the function 
(3(Al(/3), Ah^P)). 2 '' The function is continuous in f3 and satisfies /3 (Al( 0), Aff(O)) — 
P(A l (1/8),A h (1/5)) = (3* and j3(A L (P), A H ((3)) < (3* for all (3 G (0,1/5). Then, 
since 6(3* < 1, there is at least one hxed point (3 M , and any hxed point satishes 
(3 M G (0,/3*). Thus, the misspecihed agent underinvests in equilibrium compared to 
the optimal strategy . 26 The conclusion is reversed if 7* < 0 , illustrating how the 
framework provides predictions about beliefs and behavior that depend on the prim¬ 
itives (as opposed to simply postulating that the agent is over or under-conhdent 
about productivity). 


5 Equilibrium foundation 

In this section, we provide a learning foundation for the notion of Berk-Nash equilib¬ 
rium of SMDPs. We hx an SMDP and assume that the agent is Bayesian and starts 
with a prior /i 0 £ A(@) over her set of models of the world. She observes past actions 
and states and uses this information to update her beliefs about 0 in every period. 


Definition 11. For any (s,x,s') G Gr(r) x§, let B(s,x,s', •) : D SjX ^ —> A(@) denote 
the Bayesian operator: For all A C 0 Borel 


B(s,x,s',n)(A) = 


f A Qe(s' | s,x)/j,(d9) 


/© Qe{s’ | s,x)n(d0)’ 
for any p G D s ^ s >, where D s ^ s , = {p G A(0): f Q Q e (s’ \ s, x)p(dQ ) > 0}. 


( 11 ) 


Definition 12. A Bayesian Subjective Markov Decision Process (Bayesian- 
SMDP) is an SMDP(Q, <2©) together with a prior /i 0 G A(@) and the Bayesian 
operator B (see Definition 11). It is said to be regular if the corresponding SMDP 
is regular. 

25 0 ur existence theorem is not directly applicable because we have assumed, for convenience, 
nonfinite state and action spaces. 

26 It is also an equilibrium not to invest, A = (0,0), supported by the belief /3* = 0, which 
cannot be disconfirmed since investment does not take place. But this equilibrium is not robust to 
experimentation (i.e., it is not perfect; see Section 6). 
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By the Principle of Optimality, the agent’s problem in a Bayesian-SMDP can be 
cast recursively as 

max f {7r(s, x, s') + 5W(s', fi)} Q^ds'ls, x), (12) 

®er(s) J § 

where = J e Qg/j,(d9), p! = B(s,x, s ',//) is next period’s belief, updated using 
Bayes’ rule, and W : § x A(@) —y M is the (unique) solution to the Bellman equation 
(12). Compared to the case where the agent knows the transition probability function, 
the agent’s belief about O is now part of the state space. 

Definition 13. A policy function is a function / : A(0) —>■ E mapping beliefs 
into strategies (recall that a strategy is a mapping a : § —> A(X)). For any belief 
/i G A(0), state s G S, and action x G X, let f(x \ s,p) denote the probability 
that the agent chooses x when selecting policy function /. A policy function / is 
optimal for the Bayesian-SMDP if, for all s G S, n G A(@), and x G X such that 
f(x | s,fi)> 0, 

x G arg max / {7r(s, x , s') + 5W{s', /x 7 )} Q^ds'ls, x). 

£ er(s) J s 

For each /i G A(0), let E(/i) C E denote the set of all strategies that are induced 
by a policy that is optimal , i.e., 

E(/i) = {cr G E : 3 optimal / such that cr(- | s) = /(■ j s,//) for all s G §}. 


Lemma 4. There is a unique solution W to the Bellman equation in (12), and it is 
continuous in fi for all s G §; (ii) The correspondence of optimal strategies /a t —y E(/i) 
zs non-empty, compact-valued, convex-valued, and upper hemicontinuous. 

Proof. The proof is standard and relegated to the Online Appendix. □ 

Let h°° = (so, Xo, ..., s t , x t , ...) represent the infinite history or outcome path of the 
dynamic optimization problem and let HI 00 = (Gr(r))°° represent the space of infinite 
histories. For every t, let ■ El 00 —>■ A(0) denote the agent’s Bayesian beliefs, defined 
recursively by p t = B(s t -i, x t -i, s t , Pt-i) whenever p t _i G D St _ 1)Xt _ l)St (see Definition 
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11), and arbitrary otherwise. We assume that the agent follows some policy function 
/. In each period t, there is a state s t and a belief /q, and the agent chooses a (possibly 
mixed) action /(■ | £ A(X). After an action xt is realized, the state st+i is 

drawn from the true transition probability. The agent observes the realized action 
and the new state and updates her beliefs to /q +1 using Bayes’ rule. The primitives of 
the Bayesian-SMDP (including the initial distribution over states, qo, and the prior, 
/i 0 G A(0)) and a policy function / induce a probability distribution over HI 00 that 
is defined in a standard way; let denote this probability distribution over HP 0 . 

We now define strategies and outcomes as random variables. For a fixed policy 
function / and for every t, let at : HI 00 —y E denote the strategy of the agent, defined 
by setting 

^n = /(-i-,Mn)e^ 

Finally, for every t, let m t : H 00 —y A(Gr(F)) be such that, for all t, h°°, and 
(s,x) e Gr( r), 

m t (s,x I h°°) = - J^l (s,x)(s T ,X T ) 

T—0 

is the frequency of times that the outcome (s, x ) occurs up to time t. 

One reasonable criteria to claim that the agent has reached a steady-state is that 
her strategy and the time average of outcomes converge. 

Definition 14. A strategy and probability distribution (a,m) G E x A(Gr(T)) is 
stable for a Bayesian-SMDP with prior /i 0 and policy function / if there is a set 
% Ci with P-f (P) > 0 such that, for all h°° G "H, as t -> 00 , 

CTf(h°°) —> a and m t (h °°) —> m. (13) 

If, in addition, there exists a belief ji* and a subsequence {Ht(j))j such that, 

->■ n* (14) 

and, for all (s,x) G Gr(r), //* = B(s, x, s', fi*) for all s' G § such that Q^^s' \ s,x) > 
0, then (cr, m) is called stable with exhaustive learning. 

Condition (13) requires that strategies and the time frequency of outcomes sta¬ 
bilize. By compactness, there exists a subsequence of beliefs that converges. The 
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additional requirement of exhaustive learning says that the limit point of one of the 
subsequences, /i*, is perceived to be a fixed point of the Bayesian operator, implying 
that no matter what state and strategy the agent contemplates, she does not expect 
her belief to change. Thus, the agent believes that all learning possibilities are ex¬ 
hausted under /T. The condition, however, does not imply that the agent has correct 
beliefs in steady state. 

The next result establishes that, if the time average of outcomes stabilize to m, 
then beliefs become increasingly concentrated on @g(m). 

Lemma 5. Consider a regular Bayesian-SMDP with true transition probability func¬ 
tion Q, full-support prior po G A(@) ? and policy function f. Suppose that ( mt)t 
converges to m for all histories in a set Pi C El such that P f (fhi) > 0. Then, for all 
open sets U D 0g(m), 

lim p t (U) = 1 

£—>■00 

pf-a.s. in PL. 

Proof. See the Appendix. □ 

The proof of Lemma 5 clarifies the origin of the wKLD function in the definition 
of Berk-Nash equilibrium. The proof adapts the proof of Lemma 2 by Esponda and 
Pouzo (2016) to dynamic environments. Lemma 5 extends results from the statistics 
of misspecihed learning (Berk (1966), Bunke and Milhaud (1998), Shalizi (2009)) by 
considering a setting where agents learn from data that is endogenously generated by 
their own actions in a Markovian setting. 

The following result provides a learning foundation for the notion of Berk-Nash 
equilibrium of an SMDP. 

Theorem 2. There exists 5 E [0,1] such that: 

(i) for all 5 < 5, if ( cr,m ) is stable for a regular Bayesian-SMDP with full-support 
prior p 0 and policy function f that is optimal, then (a, m) is a Berk-Nash equilibrium 
of the SMDP. 

(ii) for all 5 > 5, if(cr,m) is stable with exhaustive learning for a regular Bayesian- 
SMDP with full-support prior p 0 and policy function f that is optimal, then (cr, m) is 
a Berk-Nash equilibrium of the SMDP. 
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Proof. See the Appendix. 


□ 


Theorem 2 provides a learning justification for Berk-Nash equilibrium. The main 
idea behind the proof is as follows. We can always find a subsequence of posteriors 
that converges to some //* and, by Lemma 5 and the fact that behavior converges to a , 
it follows that cr must solve the dynamic optimization problem for beliefs converging 
to /i* G @g(m). In addition, by convergence of a t to a and continuity of the transition 
kernel a H- M CT) q, an application of the martingale convergence theorem implies that 
m t is asymptotically equal to M a: Q[m t \. This fact, linearity of the operator M a ,<?[•]> 
and convergence of m t to m then imply that m is an invariant distribution given cr. 

The proof concludes by showing that a not only solves the optimization problem 
for beliefs converging to /i* but also solves the MDP, where the belief is forever fixed 
at /T. This is true, of course, if the agent is sufficiently impatient, which explains 
why part (i) of Theorem 2 holds. For sufficiently patient agents, the result relies on 
the assumption that the steady state satisfies exhaustive learning. We now illustrate 
and discuss the role of this assumption. 

EXAMPLE. At the initial period, a risk-neutral agent has four investment choices: 
A, B, S, and O. Action A pays 1 — 9*, action B pays 6*, and action S pays a safe 
payoff of 2/3 in the initial period, where 9* G (0,1}. For any of these three choices, 
the decision problem ends there and the agent makes a payoff of zero in all future 
periods. Action O gives the agent a payoff of —1/3 in the initial period and the option 
to make an investment next period, where there are two possible states, sa and SB- 
State sa is realized if 9* = 1 and state sb is realized if 9* = 0. In each of these 
states, the agent can choose to make a risky investment or a safe investment. The 
safe investment gives a payoff of 2/3 in both states, and a subsequent payoff of zero 
in all future periods. The risky investment gives the agent a payoff that is thrice the 
payoff she would have gotten from choice A, that is, 3(1 — 9*), if the state is s^, and 
it gives the agent thrice the payoff she would have gotten from choice B, that is, 39*, 
if the state is s B ; the payoff is zero is all future periods. 

Suppose that the agent knows all the primitives except the value of 9*. Let 
0 = {0,1}; in particular, the SMDP is correctly specified. We now show that, in any 
Berk-Nash equilibrium, a sufficiently patient agent never chooses the safe action S: 
Let fi G [0,1] denote the agent’s equilibrium belief about the probability that 9* = 1. 
For action S to be preferred to A and B, it must be the case that /i G [1/3, 2/3]. But, 
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for a fixed //, the perceived benefit from action O is 


+ 5 (/J,V Qfi (s A ) + (1 - ^Vq^sb)) 



> ~ + 56fi(l -/i), 


which is strictly higher than 2/3, the payoff from action S, for all /j G [1/3, 2/3] 
provided that 5 > 5 = 3/4. Thus, for a sufficiently patient agent, there is no belief that 
makes action S optimal and, therefore, S is not chosen in any Berk-Nash equilibrium. 

Now consider a Bayesian agent who starts with a prior po = Pr(0 = 1) G (0,1) 
and updates her belief. The value of action O is 


— 2 + <5 (hof'P ( s a , 1) + (1 _ ho)kh ( sb , 0)) — — g + < - 

because W(s / 4,1) = W(sb,0) = 2/3. In other words, the agent realizes that if the 
state sa is realized, then she will update her belief to = 1, which implies that the 
safe investment is optimal in state s A ] a similar argument holds for state sb- She then 
folds it optimal to choose action A if /x 0 < 1/3, B if p 0 > 2/3, and S if p 0 £ [1/3, 2/3]. 
In particular, choosing S is a steady state outcome for some priors, although it is not 
chosen in any Berk-Nash equilibrium if the agent is sufficiently patient. The belief 
supporting S, however, does not satisfy exhaustive learning, since the agent believes 
that any other action would completely reveal all uncertainty. □ 

More generally, the failure of a steady state to be a Berk-Nash equilibrium if 
the agent is sufficiently patient occurs because the value of experimentation can be 
negative. To see this point, let the value of experimentation for action x at state s 
when the agent’s belief is // be 


ValueExp(s, x\p) = 


[ir(s',B( s ,x,s',rt)] - [y,,. (S')] 


This expression is the difference between the value when the agent updates her prior 
H and the value when the agent has a fixed belief //. An agent who does not account 
for future changes in beliefs may end up choosing an action with a negative value of 
experimentation that is actually suboptimal when accounting for changes in beliefs. 
In the previous example, the value of experimentation for action O given // is 


GO | to 



{fjW(s A ,l) + (1 -n)W(s B ,0)) - (nV q^sa) + (1 - h)Vq^s b )) , 

which reduces to 2/3 — 6/i(l — /i) and is negative for the values of p that make S 
better than A and B. Thus, it is possible for action 0 to be optimal if the agent does 
not account for changes in beliefs, but suboptimal if she does. 

We now discuss specifically how the property of exhaustive learning is used in the 
proof of Theorem 2. We call an action a steady-state action if it is in the support 
of a stable strategy and we call it a non steady-state action otherwise. A key step is 
to show that, if a steady-state action is better than a non steady-state action when 
beliefs are updated, it will also be better when beliefs are fixed. This is true provided 
that there is zero value of experimenting in steady state, which is guaranteed by 
exhaustive learning. If instead of exhaustive learning we were to simply require weak 
identification, there would be no value of experimentation for steady-state actions. 
The concern, illustrated by the previous example, is that the value of experimentation 
can be negative for a non steady-state action. Therefore, a non steady-state action 
could be suboptimal in the problem where the belief is updated but optimal in the 
problem where the belief is not updated (and so the negative value of experimentation 
is not taken into account). As shown by Esponda and Pouzo (2016), this concern does 
not arise in static settings, where the only state variable is a belief. The reason is that 
the convexity of the value function and the martingale property of Bayesian beliefs 
imply that the value of experimentation is always nonnegative. 

We conclude with additional remarks about Theorem 2. 

Remark 3. Discount factor: In the proof of Theorem 2, we provide an exact value for 
5 as a function of primitives. This bound, however, may not be sharp. As illustrated 
by the above example, to compute a sharp bound we would have to solve the dynamic 
optimization problem with learning, which is precisely what we are trying to avoid 
by focusing on Berk-Nash equilibrium. 

Convergence : Theorem 2 does not imply that behavior will necessarily stabilize in 
an SMDP. In fact, it is well known from the theory of Markov chains that, even if no 
decisions affect the relevant transitions, outcomes need not stabilize without further 
assumptions. So one cannot hope to have general statements regarding convergence 
of outcomes—this is also true, for example, in the related context of learning to 
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play Nash equilibrium in games. 2 ' Thus, the theorem leaves open the question of 
convergence in specific settings, a question that requires other tools (e.g., stochastic 
approximation) and is best tackled by explicitly studying the dynamics of specific 
classes of environments (see the references in the introduction). 

Mixed strategies : Theorem 2 also raises the question of how a mixed strategy 
could ever become stable, given that, in general it is unlikely that agents will hold 
beliefs that make them exactly indifferent at any point in time. Fudenberg and Kreps 
(1993) asked the same question in the context of learning to play mixed strategy 
Nash equilibria, and answered it by adding small payoff perturbations a la Harsanyi 
(1973): Agents do not actually mix; instead, every period their payoffs are subject 
to small perturbations, and what we call the mixed strategy is simply the probabil¬ 
ity distribution generated by playing pure strategies and integrating over the payoff 
perturbations. We followed this approach in the paper that introduced Berk-Nash 
equilibrium in static contexts (Esponda and Pouzo, 2016). The same idea applies 
here, but we omit payoff perturbations to reduce the notational burden. 28 

6 Equilibrium refinements 

Theorem 2 implies that, for sufficiently patient players, we should be interested in 
the following refinement of Berk-Nash equilibrium. 

Definition 15. A strategy and probability distribution (cr, m) G £ x A(Gr(T)) is 
a Berk-Nash equilibrium with exhaustive learning of the SMDP if it is a 
Berk-Nash equilibrium that is supported by a belief fi* G A(O) such that, for all 
(s,x) e Gr( r), 

H* = B(s, x, s ', //) 

for all s' G § such that Q(s' | s, x) > 0. 

In an equilibrium with exhaustive learning, there is a supporting belief that is 
perceived to be a fixed point of the Bayesian operator, implying that no matter what 
state and strategy the agent contemplates, she does not expect her belief to change. 

27 For example, in the game-theory literature, general global convergence results have only been 
obtained in special classes of games-e.g. zero-sum, potential, and supermodular games (Hofbauer 
and Sandholm, 2002). 

28 Doraszelski and Escobar (2010) incorporate payoff perturbations in a dynamic environment. 
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The requirement of exhaustive learning does not imply robustness to experimentation. 
For example, in the monopoly problem studied in Section 4.1, choosing low price with 
probability 1 is an equilibrium with exhausted learning which is supported by the 
belief that, with probability 1, 9* L = 0. We rule out equilibria that are not robust to 
experimentation by introducing a further refinement. 

Definition 16. An e-perturbed SMDP is an SMDP wherein strategies are restricted 
to belong to 

£ e = {a G X : a(x | s) > £ for all (s,x) G Gr(r)} . 


Definition IT. A strategy and probability distribution (a, m) G £ x A(Gr(r)) is a 
perfect Berk-Nash equilibrium of an SMDP if there exists a sequence (<r e , m e ) e >o 
of Berk-Nash equilibria with exhaustive learning of the ^-perturbed SMDP that con¬ 
verges to (a, m) as e —> 0. 29 

Selten (1975) introduced the idea of perfection in extensive-form games. By itself, 
however, perfection does not guarantee that all (s, x ) G Gr(r) are reached in an MDP. 
The next property guarantees that all states can be reached when the agent chooses 
all strategies with positive probability. 

Definition 18. An MDP(Q) satisfies full communication if, for all s 0 , s' G §, there 
exist finite sequences (si,...,s n ) and (x 0 ,xi, ...,x n ) such that (sj,Xj) G GV(r) for all 
i = 0,1,..., n and 


Q^S I Sm X n ^Q(s n | Sn—1, ZCn—l) ■ ■ -Q (M | ^o) 0. 

An SMDP satisfies full communication if the corresponding MDP satisfies it. 

Full communication is standard in the theory of MDPs and holds in all of the 
examples in Section 4. It guarantees that there is a single recurrent class of states 
for all e-perturbed environments. In cases where it does not hold and there is more 
than one recurrent class of states, one can still apply the following results by focusing 

29 Formally, in order to have a sequence, we take £ > 0 to belong to the rational numbers; here¬ 
inafter we leave this implicit to ease the notational burden. 
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on one of the recurrent classes and ignoring the rest as long as the agent correctly 
believes that she cannot go from one recurrent class to the other. 

Full communication guarantees that there are no off-equilibrium outcomes in a 
perturbed SMDP. It does not, however, rule out the desire for experimentation on 
the equilibrium path. We rule out the latter by requiring weak identification. 

Proposition 2. Suppose that an SMDP is weakly identified, e-perturbed, and satisfies 
full communication. 

(i) If the SMDP is regular and if (cr, m) is stable for the Bayesian-SMDP, it is 
also stable with exhaustive learning. 

(ii) If (cr, m) is a Berk-Nash equilibrium, it is also a Berk-Nash equilibrium with 
exhaustive learning. 

Proof. See the Appendix. □ 

Proposition 2 provides conditions such that a steady state satisfies exhaustive 
learning and a Berk-Nash equilibrium can be supported by a belief that satisfies 
the exhaustive learning condition. Under these conditions, we can find equilibria 
that are robust to experimentation, i.e., perfect equilibria, by considering perturbed 
environments and taking the perturbations to zero (see the examples in Section 4). 

The next proposition shows that perfect Berk-Nash is a refinement of Berk-Nash 
with exhaustive learning. As illustrated by the monopoly example in Section 4.1, it 
is a strict refinement. 

Proposition 3. Any perfect Berk-Nash equilibrium of a regular SMDP is a Berk- 
Nash equilibrium with exhaustive learning. 

Proof. See the Appendix. □ 

We conclude by showing existence of perfect Berk-Nash equilibrium (hence, of 
Berk-Nash equilibrium with exhaustive learning, by Proposition 3). 

Theorem 3. For any regidar SMDP that is weakly identified and satisfies full com¬ 
munication, there exists a perfect Berk-Nash equilibrium. 

Proof. See the Appendix. Q 
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7 Conclusion 


We studied Markov decision processes where the agent has a prior over a set of 
possible transition probability functions and updates her beliefs using Bayes’ rule. 
This problem is relevant in many economic settings but usually not amenable to 
analysis. We propose to make it more tractable by studying asymptotic beliefs and 
behavior. The answer to the question “Can the steady state of a Bayesian-SMDP be 
characterized by reference to an MDP with fixed beliefs?” is a qualified yes. If the 
agent is sufficiently impatient, it suffices to focus on the set of Berk-Nash equilibria. 
If, on the other hand, the agent is sufficiently patient and we are interested in steady 
states with exhaustive learning, then these steady states are characterized by the 
notion of Berk-Nash equilibrium with exhaustive learning. Finally, if we are interested 
on equilibria that are robust to experimentation, we can restrict attention to the set 
of perfect Berk-Nash equilibria. 

Our results hold for both the correctly-specified and misspeciked cases, and we 
are not aware of any prior general results for either of these cases. For the correctly- 
specified case, our results can justify the common assumption in the literature that 
the agent knows the transition probability function provided that strong identification 
holds (or that there is weak identification and one is interested in equilibria that are 
robust to experimentation). In the misspecihed case, our results significantly expand 
the range of possible applications. 
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Appendix 

Proof of Lemma 2. Iq(ct) is nonempty. M a) Q is a linear (hence continuous) self¬ 
map on a convex and compact subset of an Euclidean space (the set of probability 
distributions over the finite set GV(r)); hence, Brower’s theorem implies existence of 
a fixed point. 

Iq(ct) is convex valued: For all a G [0,1] andmi,m 2 G A(Gr(r)), aM^g[mi]-|-(l- 
a)M a:Q [m 2 ] = M aQ [am\ + (1 - a)m 2 \. Thus, if m 1 = M a)Q [mi\ and m 2 = M^ Q [m 2 \, 
then ami + (1 — a)m 2 = [am i + (1 — a)m 2 \. 

Iq(ct) is upper hemicontinuous and compact valued: Fix any sequence ( cr n ,m n ) n 
in S x A(GV(r)) such that lim n _ ) . 00 (cr n , m n ) = (a, m) and such that m n E lQ(cr n ) for 
all n. Since M a ^ Q [m n ] = m n , \\m - M^ Q [m\\\ < \\m - m n \\ + \\M a ^ Q [m n - m\\\ + 
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\\M„ , Q [m] — The first term in the RHS vanishes by the hypothesis. The 

second term satisfies \\M ant Q[m n — m) | < ||M CTni Q|| x || m n — m\\ and also vanishes. 30 
For the third term, note that a K)■ is a linear mapping and sup^ | Ma.Q [ru] 11 < 

max s / | Y^(s,x)eGr(T) Q( s ' I s,x)m(s,x)\ < oo. Thus \\M an>Q [m\-M CiQ [rn\\\ < Kx\\a n - 
cr11 for some K < oo , and so it also vanishes. Therefore, m = M a Q[m \; thus, Iq(-) 
has a closed graph and so Iq(ct) is a closed set. Compactness of Iq(ct) follows from 
compactness of A(Gr(T)). Therefore, Jq(-) is upper hemicontinuous (see Aliprantis 
and Border (2006), Theorem 17.11). □ 

The proof of Lemma 3 relies on the following claim. The proofs of Claims A, B, 
and C in this appendix appear in the Online Appendix . 


Claim A. (i) For any regular SMDP, there exists 6* G O and K < oo such 
that, for all m G A(Gr(T)), KQ(m,9*) < K. (ii) Fix any 9 G O and a sequence 
(m n ) n in A(Gr(T)) such that Qg(s' \ s,x) > 0 for all ( s',s, x ) G S x GV(T) such that 
Q(s’ | s, x) > 0 and lim^^ m n = m. Then lirn^oo K Q (m n , 9) = K Q (m, 9). (in) K Q 
is (jointly) lower semicontinuous: Fix any ( m n ) n arid ( 9 n ) n such that lim, woo m n = m 
and lim ?woo 9 n = 9. Then lirn inf^oo K Q (m n , 9 n ) > K Q (m , 9). 


Proof of Lemma 3. (i) By Jensen’s inequality and strict concavity of ln(-), 
K Q (m, 9) > - E( Si:c )eGr(r) ln ( E Q(- |a,x)[ Q q[s'\^x) ]y 1711 ( g :» x ) = °> with equality if and only 
if Qg(- | s,x) — Qg( ■ | s,x) for all (s,x) such that m(s,x ) > 0. 

(ii) Oq(m) is nonempty. By Claim A(i), there exists K < oo such that the 
minimizers are in the constraint set {9 G O : ATg(m, 9) < K}. Because iCg(m, •) is 
continuous over a compact set, a minimum exists. 

Oq(-) is uhc and compact valued: Fix any (m n ) n and ( 9 n ) n such that lim^oo m n = 
m, lim^oo 9 n = 9, and 9 n G 0 Q{m n ) for all n. We establish that 9 G 0g(m) (so 
that ©(■) has a closed graph and, by compactness of O, it is uhc). Suppose, in 
order to obtain a contradiction, that 9 (f 0Q(m). Then, by Claim A(i), there exists 
9 G 0 and e > 0 such that Kq(ui, 9) < Kq(m,9 ) — and iCg(m,0) < oo. By 
regularity, there exists {9f)j with liuq^oc 6^ = 9 and, for all j, Qg (s’ \ s,x) > 0 
for all (s',s,x) G § 2 x X such that Q(s' \ s,x) > 0. We will show that there is 
an element of the sequence, 9j , that “does better” than 9 n given m n , which is a 
contradiction. Because iLg(m,0) < oo, continuity of /Tg(m, ■) implies that there 
exists J large enough such that Ag(m,0j) — Kq^ui, 9) < e/2. Moreover, Claim 
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For a matrix A, ||A|| is understood as the operator norm. 
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A(ii) applied to 6 = Oj implies that there exists N E>J such that, for all n > N e> j, 
K Q (m n , 9j ) - K Q (m , 9j) < e/2. Thus, for all n > N £yJ , | K Q (m n , 9j ) - K Q (m, 9) \ < 
K Q (m n , 9j ) - Ag(m, 0/) + | K Q (m, 9j) - K Q (m, 9) \ < £ and, therefore, 


Ag(m n , (9j) < Ag(m, (9) + e < K Q (m, 9) - 2e. (15) 

Suppose K Q (m,9) < oo. By Claim A(iii), there exists n £ > iV^j such that 
Kq( m ne , 9 rie ) > KQ(m, 9)—e. This result, together with (15), implies that KQ(m ne , 9j) < 
Kq{ m ne , 9 ne ) — e. But this contradicts 9 Ue G ©Q(m ne ). Finally, if KQ(m,9 ) = oo, 
Claim A(iii) implies that there exists n £ > N £t j such that Kq( m ne , 9 ne ) > 2 K, where 
K is the bound defined in Claim A(i). But this also contradicts 9 Ue e 0Q(m ne ). Thus, 
0 q(-) has a closed graph, and so 0Q(m) is a closed set. Compactness of 0g(m) follows 
from compactness of 0. Therefore, 0 q(-) is upper hemicontinuous (see Aliprantis and 
Border (2006), Theorem 17.11). □ 

Proof of Theorem 1. Let W = E x A(GV(T)) x A(0) and endow it with 
the product topology (given by the Euclidean one for £ x A(Gr(T)) and the weak 
topology for A(0)). Clearly, W ^ {0}. Since 0 is compact, A(0) is compact under 
the weak topology; £ and A(Gr(T)) are also compact. Thus by Tychonoff’s theorem 
(see Aliprantis and Border (2006)), W is compact under the product topology. W is 
also convex. Finally, W C M 2 x rca(0) where M is the space of |S| x |X| real-valued 
matrices and rca(0) is the space of regular Borel signed measures endowed with the 
weak topology. The space M 2 x rca(0) is locally convex with a family of seminorms 
{(cr,m,/i) i —y pf(a,m,/j,) = ||(cr,m)|[ + | f Q f(x)/i(dx)\: f G C(f2)} (C(f2) is the space 
of real-valued continuous and bounded functions and 11.11 is understood as the spectral 
norm). Also, we observe that () = 0 iff pf(a,m, p) = 0 for all / G C(f2), thus 
M 2 x rca(0) is also Hausdorff. 

Let T : W —» 2 W be such that T(cr, m, p) = £(Q^) X Iq(o) x A(©g(m)). Note that 
if (cr *,m*, p*) is a hxed point of T, then m* is a Berk-Nash equilibrium. By Lemma 1, 
£(•) is nonempty, convex valued, compact valued, and upper hemicontinuous. Thus, 
for every p G A(0), £(<Q M ) is nonempty, convex valued, and compact valued. Also, 
because Qe is continuous in 9 (by regularity assumption), then Q M is continuous (un¬ 
der the weak topology) in p. Since Q i —> £(Q) is upper hemicontinuous, then £(Q M ) is 
also upper hemicontinuous as a function of p. By Lemma 2, /q(-) is nonempty, convex 
valued, compact valued and upper hemicontinuous. By Lemma 3 and the regularity 
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condition, the correspondence ©q(-) is nonempty, compact valued, and upper hemi- 
continuous; hence, the correspondence A(@q(-)) is nonempty, upper hemicontinuous 
(see Aliprantis and Border (2006), Theorem 17.13), compact valued (see Aliprantis 
and Border (2006), Theorem 15.11) and, trivially, convex valued. Thus, the corre¬ 
spondence T is nonempty, convex valued, compact valued (by Tychonoff’s Theorem), 
and upper hemicontinuous (see Aliprantis and Border (2006), Theorem 17.28) under 
the product topology; hence, it has a closed graph (see Aliprantis and Border (2006), 
Theorem 17.11). Since W is a nonempty compact convex subset of a locally Hausdorff 
space, then there exists a fixed point of T by the Kakutani-Fan-Glicksberg theorem 
(see Aliprantis and Border (2006), Corollary 17.55). □ 

For the proof of Lemma 5, we rely on the following definitions and Claim. Let 
K*{m ) = infgg© Kq^ui, 8) and let 0 C 0 be a dense set such that, for all 8 e 0, 
Qg(s' | s,x) > 0 for all (s',s,x) e § x Gr(T) such that Q(s' \ s,x) > 0. Existence of 
such a set 0 follows from the regularity assumption. 

Claim B. Suppose lirn^oo \\mt — m\\ = 0 a.s.-P^ . Then: (i) For all 0 6 0, 


lim t 1 

t—¥ OO 




— 1? %T— l) 

Qo(St\St— 1} %T— l) 


eq(.\ s ,x) 

(s,x)£Gr( r) 



QQS'M) - 

Qe{S'\s,x). 


m(s, x) 


a.s.-PA (ii) For P^-almost all h°° e H°° and for any e > 0 and a = (inf©. d m {e)>e Ag(m, 8) — 
K*(m))/ 3, there exists T such that, for all t>T, 


t ^log 

T= 1 


Q{ s t | 5 r-l) X r — i) . . 3 

———:-r- > K (m) + -a 

Q6\&T |®T— 1) %T— l) 2 


for all 8 G {0: d m {8) > e}, where d m {8) = inf 0 - geQ(m) \ \8 - 8 1|. 

Proof of Lemma 5. It suffices to show that lim^oo / e d m {9)Ht(d6) = 0 a.s.-P^ 
over H. Let K*(m ) = Rgirri, 0g(m)). For any rj > 0 let Q v (m) = {8 e 0 : d m (8) < r/}, 
and Q v (m) — 0 fl 0, ? (m) (the set 0 is defined in condition 3 of Definition 6, 
i.e., regularity). We now show that /io (Q v (m)) > 0. By Lemma 3, 0g(m) is 
nonempty. By denseness of 0, 0 j; (m) is nonempty. Nonemptiness and continuity 
of 8 Q dj imply that there exists a non-empty open set U C 0, ? (?n). By full 
support, /io(0^(m)) > 0. Also, observe that for any e > 0, {0: d m {8) > e} is 
compact. This follows from compactness of 0 and continuity of 6? i—> d m {8) (which 
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follows by Lemma 3 and an application of the Theorem of the Maximum). Com¬ 
pactness of {0: d m {9) > e} and lower semi-continuity of 9 » KQ(m,9) (see Claim 

A(iii)) imply that inf e-. d m { 9 )>e K Q (m,9) = min 0: dm {0)>e K Q (m, 9) > K*(m). Let 
a = (miiiQ. d m (e)>t Kq^JU, 9) — K*(m))/3 > 0. Also, let > 0 be chosen such that 
KQ(m,9 ) < K*(m) + 0.25a for all 9 6 ©^(m) (such ij always exists by continuity of 
9 (-)• K Q (m,9)). 

Let Hi be the subset of H for which the statements in Claim B hold; note that 
pf (H \ Hi) = 0. Henceforth, fix h°° G Hi ; we omit h°° from the notation to ease 
the notational burden. By simple algebra and the fact that d m is bounded in 0, it 
follows that, for all e > 0 and some finite C > 0, 

, ('fl'i ( rn\ f e dm(9)Qe(s t | St—i, Xt—i (dd) Jq d m (0)^t(0)/io(d^) 
m[l } = f e Qo(st I st-uXt-^Ht-iidd) = f e Z t (9)Md9 ) 

< „ f{0-. d m («)>,l _ A t (e) 

Je„(m) ZtWMM) - B t (nY 



where Z t (») = nLi 
suffices to show that 


exp 


{-£i„i°g( 


Q(s t |s t _i,x t _i) \ 1 
Qg(Sr|s T _l,3: r _l) ) y 


Hence, it 


limsup {exp {t ( K*(m ) + 0.5a)} A t (e)} = 0 (16) 

t—> OO 

and 

liminf {exp {t ( K*(m) + 0.5a)} B t (rf)} = cx). (17) 

t—>oo 

Regarding equation (16), we first show that 


lim sup j(iL*(m) + 0.5a) 

t_KX> {0: d m (0)> 6} 


-1 ^log 


Q(^r\^T-l') %r-l) 1 
Qo(St\St— 1} %T — l) ^ 


< const < 0. 


To show this, note that, by Claim B(ii) there exists a T, such that for all t > T, 
r 1 Ytr= 1 lo § QelilZ~-uxr-!) - K *( m ) + l a ’ for a11 9 e {©: d m [9) > e}. Thus, 


lim sup 

^°° (0: d m (6)>e} 


K*{m ) + | -t ^log 

T=1 


Q(s T |s r _l,£ r _l) 'l 
Q6>(5 T |5 r -l,^r-l) ^ 


< —a. 
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Therefore, 


limsup {exp {t (. K*(m ) + 0.5a)} A t (e)} 

t—> OO 

< lim sup sup expjf ( (. K*(m) + 0 . 5 a) — t log %t- i) 

t^oo {0:d m (0)> e} i V Qe(s T \s T -i, X T -i) 

= 0. 


Regarding equation (17), by Fatou’s lemma and some algebra it suffices to show 
that 

lim inf exp {t ( K*(m) + 0.5a)} Z t (9) — oo > 0 

t—¥ OO 

(pointwise on 9 G @, 7 (m)), or, equivalently, 


lim inf (K*(m) + 0.5a 

t—¥ OO \ 


-t~ l 


E lo s 


1? X T — i) \ 

Qo ('-’r | $t— 1 5 ^r—l) ' 


> 0. 


By Claim B(i), 


lim iiiiiK*(m) + 0.5a — t 1 V log 1 " }t — K*(m) + 0.5a — K Q (m , 6 *) 

t_>0 ° ' ve»(s T |s T _i,a; r _i)/ 

(pointwise on 0 G Q v (m)). By our choice of 77 , the RHS is greater than 0.25a and our 
desired result follows. □ 

Proof of Theorem 2. For any s G § and /i G A(0), let 


x(s, n) = arg max £q m (.|^) [tt(s, x, 5')] 


<5(s,/i) 


r 111 


i£r(s)\o;(s,/i) lisr(s) 


max % ( . kl) [vr(s,x,F')] - £q m (.| s , z) [tt(s,z,S')] 


<5 = max{min5(s, //), 0} 


S,fl 


5 = max{<5 > 0 | 5 — 2- -M > 0} 

1 l — o J 


S/M 

2 + 5/M ’ 


where M = max (SiX ) gGr(r ) >s£S - | 7 r(s, x, s') I- 

By Lemma 5, for all open sets 7/ D @g(m), lim^oo/ 7 t (7/) = 1 a.s.-P^ in "H. 
Also Let g T (h°°)(s,x ) = l( SlZ )(s T ,a; T ) — M ar (s,x | s T _i,x T _i) for any r and (s,x) G 
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Gr(r) and h°° G HI. The sequence (g T ) T is a Martingale difference and by analogous 
arguments to those in the proof of Claim B: lim^oo ||f _1 r?T-(/z°°) 11 = 0 a.s.-Ph 

Let TL* be the set in TL such that for all h°° G TL* the following holds: for all open 
sets U D ©Q(m), lirn^oc fi t (U) = 1 and lmp^ ||t _1 Yt=o 5 h(h°°)|| = 0 . Note that 
Pf (TL \ TL*) = 0 . Henceforth, fix an h°° G TL*, which we omit from the notation. 

We first establish that m G Iq(ct). Note that 

||nn - M CT! q [m] || < ||m - m t || + ||m t - M^q [m}\\ 


where (s,x) M atQ [p\(s, x) = Ys,:reGr(r) M A S > x\s,x)p(s,x) for any p G A(Gr(T)). 

By stability, the first term in the RHS vanishes, so it suffices to show that liuq^oo | \m t — 
M<j,q [ m ] 11 = 0. The fact that liup^oo || t~ l Y1 t=o 5v| I = 0 and the triangle inequality 
imply 

t 

lirn || m t - M a n [m]\\ < lim ||m t - t _1 V' M ar q(-, ■ | s T _i, x T -\) || 

t— »oo" " i—»OO n ^' ' 11 

T= 1 

t 

+ lim ||f _1 V'M CTt:Q (-, • | s T _i,x r _i) - M CTiQ [m] || 

t—T oo L * 

T —1 

t t 

= lim ||£ _1 VVll + lim ||f _1 M CTtiQ (-, • | s r _ 1 ,x T - 1 ) - M a>Q [m] \ 

i—»cx d" ^—' t —>oo *—' 

r= 1 r=l 

< t 

< lim ||t _1 ^2 M CTti q(-, • I s r _ 1 , x r _i) - M CTj q [f _1 ^ l(. i .)(s T _i, x r _i 

T =1 T—1 

t 

+ hm [[M^q [t -1 ^ !(■,■) (’S r — i, x T _i)j (18) 

T— 1 

Moreover, by definition of M CTj q (see equation (4)), for all ( s,x ) G Gr(T), 


t 

r l ^M CTt;Q (s,x | s T _i, x r _i) 

T—1 


t 

M °’Q [ t_1 1 (v)( S r-l, Zt-i)] 

T—1 


t 


y Q(s\s,x)t 1 ^ 0 -r(^|s)l(s,£)(s T _i,X T _i) 


s,xSGr(r) 

T= 1 



t 

s, x)t~ l a(x 

( 19 ) 


l(s,x)(^r—1? l)• 

s,xGGr(r) 

T= 1 




(20) 
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Equations (19) and (20) and stability —> a) imply that the first term in the 

RHS of 18 vanishes. The second term in the RHS also vanishes under stability due 
to continuity of the operator M a [.] and the fact that t~ l J2 t=i l(v)( s T -i; 3 V-i) = 
^rrit-i (-, •)■ Thus, \\m - M^ q [m)\\ = 0, and so me Jq(ct). 

Therefore, for proving cases (i) and (ii), we need to establish that, for each case, 
there exists /i G A(@g(m)) such that a is an optimal strategy for the MDP(Q /i ). 

(i) Consider any 5 G [0, 5]. Since A(@) is compact under the weak topology, there 
exists a subsequence of (nt)t — which we still denote as (/i f ) t — such that /i t A 
and /Too G A(@Q(m)). Since a t G E(/A for all t and E is uhc (see Lemma 4), stability 
( a t —> a) implies a G E(/ioo). We conclude by showing that a is an optimal strategy 
for the MDP(Q Atoo ). If 5 — 8 — 0, this assertion is trivial. If 5 > S > 0, it suffices to 
show that 


x(s,/z 0 o) = arg max 
ier(s) 


{7r(s,x,s') + 8W{s', B(s, x, s', (ds'\s, x) 


= arg max 

ier(s) 


{x(s,x, s') + Q^(ds'\s,x). 


( 21 ) 


We conclude by establishing (21). Note that, since 5 > 0, it follows that 6 > 0, which 
in turn implies that x(s,/zA is a singleton. The first equality in (21) holds because, 
by definition of 5, 


SM 


Eq^ { .\ s , x (s,^ „)) x(s, Roo) , S')} - Eq^ { . m [tt(s, x, S')] >S> 2^—^ > 0 


for all x G T(s)\{x(s, Hoc)}, and, by definition of M, 


5M 

A-5 


>5 \W(s',B( y s 1 x,s , ,n 00 ))Q f , oo (ds , \s,x)-W(s , ,B(s 1 x(s,n 00 ),s\iJ, O0 ))Q floo (ds , \s 1 x(s,fi 00 )) 


The second equality in (21) holds by similar arguments. 

(ii) By stability with exhaustive learning, there exists a subsequence (fM{j))j such 
that Ht(j) A n*. This fact and the fact that for all open U 5 €A( m )> l-k(j) (U) = 

1, imply that fi* G A(@g(m)). Since (7 t {j) £ A/hO)) for all j and E is uhc (see 
Lemma 4), stability ( a t -k a) implies cr G E(/A). Moreover, by condition of stability 
with exhaustive learning (i.e., fi* = B(s, x, s', fi*) for all ( s,x ) G Gr(T) and s' G 
supp(Q IJ ,*(-\s,x))), W (s, n*) = max ie r( s ) f s {7r(s,x,s') + 5W(s', fi*)} Quids'\s, x) for 
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all s E S. Then, by uniqueness of the value function, a is an optimal strategy for the 
MDP(g M j. □ 

The proof of Proposition 2 relies on the following claim. 

Claim C. If (cr, m) 6 E x A(§ x X) is such that a E T, £ and m E with Q 

satisfying the full communication condition in Definition 18, then m(s,x ) > 0 for all 
(s, x) E Gr( r). 

Proof of Proposition 2. (i) We show that, if (cr, m) is stable for a Bayesian- 
SMDP that is e-perturbed, weakly identified and satisfies full communication (and 
has a prior po and policy function / ), then (cr,m) is stable with exhaustive learning. 
That is, we must find a subsequence (pt<j))j such that ppp converges weakly to 
p* and p* = B(s,x, s', fi*) for any (s,x) E Gr( T) and s' E supp{Q^{- | s, a;)). 
By compactness of A(0), there always exists a convergent subsequence with limit 
point p* E A(0). By Lemma 5, p* E A(0g(m)). By assumption, a E S £ and, 
by the arguments given in the proof of Theorem 2, m E Iq{o). Since the SMDP 
satisfies full-communication, by Claim C, support) = Gr(r). This result, the fact 
that p* E A(0g(m)), and weak identification imply strong identification, i.e., for any 
9i and 9 2 in the support of p *, Qg 1 (- \ s,x) = Qg 2 (- \ s,x) for all (s,x) E Gr(T). 
Hence, it follows that, for all AC0 Borel and for all (s,x) G Gr(T) and s' E S such 
that | s,x) > 0 (i.e., J e Qg(s' \ s,x)p*(d9) > 0), 


B(s,x,s',p*)(A) 


J A Qe(s' | s,x)p*(d9) 
f @ Qg{s' | s, x)p*(d9) 


Thus, p* satisfies the desired condition. 

(ii) We prove that if (cr, m) is a Berk-Nash equilibrium, then it is also a Berk-Nash 
equilibrium with exhaustive learning. Let p be the supporting equilibrium belief. By 
Claim C and weak identification, it follows that there is strong identification, and so 
for any 9\ and 9 2 in the support of p, Qe 1 (- \ s,x) — Qg 2 (■ j s,x) for all (s,x) E Gr(T). 
It follows that, for all AC0 Borel and for all (s,x) E Gr(F) and s' E S such that 
Q^s' | s,x) > 0 (i.e., f e Qe(s' \ s,x)p(d9) > 0), 


B(s,x,s',p)(A ) 


f A Qg(s' I s, x)p(d9) 
f e Qe(s' | s, x)p(d9) 


Thus, (cr, m) is a Berk-Nash equilibrium with exhaustive learning. □ 
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Proof of Proposition 3. Suppose (a, m) is a perfect Berk-Nash equilibrium 
and let (cx £ , m £ , p £ ) £ be the associated sequence of equilibria with exhausted learning 
such that lim e _>o(cr £ , m £ ) = (cr, m). By possibly going to a sub-sequence, let p = 
lirrn^o p £ (under the weak topology). By the upper hemicontinuity of the equilibrium 
correspondence T(cr, m, /i) = x/g((j) x A(0g(m)) (see the proof of Theorem 1), 

(cr, m ) is a Berk-Nash equilibrium with supporting belief /i. We conclude by showing 
that (cr, m) is a Berk-Nash equilibrium with exhaustive learning. 

For all ( s, x ) G Gr(T) and s' G supp ^(-\s, x)), and for all / : 0 —>■ M 
bounded and continuous, |/ f(9)p(d9) — f f(9)B(s, x, s', p)(d9)j < /(6 l ) / u(<i6 ) ) — 

f f{9)p £ (d9) | + |/ f{9)p £ (d9) — f f(9)B(s,x,s',p)(d9)j. The first term in the RHS 
vanishes as e —> 0 by definition of weak convergence. For the second term, note 
that, for sufficiently small e, s' G supp (Q^(-|s, x )), and so, since p £ = B(s, x, s', //) 
for any (s,x) G Gr(T) and s' G supp (Q M e(-|s,a:)), we can replace f f{6)p £ {d9) with 
f f(9)B(s, x, s ', p £ )(d9). Thus, the second term vanishes by continuity of the Bayesian 
operator. Therefore, by a standard argument 31 , p(A) = B(s, x, s', p)(A) for all iC0 
Borel and all ( s,x ) G Gr(T) and s' G supp (Q/i(-|s, x)) , which implies that ( <j,m ) is a 
Berk-Nash equilibrium with exhaustive learning.□ 

Proof of Theorem 3. Existence of a Berk-Nash equilibrium of an e-perturbed 
environment, ( cr £ , m £ ), follows for all £ G (0, e], where e — 1/(|X| + 1), from the same 
arguments used to establish existence for the case £ = 0 (see Theorem 1). Weak iden¬ 
tification, full communication and Proposition 2(ii) imply that there exists a sequence 
(<j £ , m e ) e >o of Berk-Nash equilibrium with exhaustive learning. By compactness of 
E x A(Gr(r)), there is a convergent subsequence, which is a perfect Berk-Nash equi¬ 
librium by definition. □ 


31 Suppose in A(0) are such that | J f(9)/j,i(dd) — f f(9)p 2 (d9) \ = 0 for any / bounded and 

continuous. Then, for any F C 0 closed, H\{F) — p, 2 (F) < [/f(0)] — U 2 (F) = [/f(9)] — 112 (F), 

where fp is any continuous and bounded and fp > 1{f>; w e call the class of such functions Cf. Thus, 
Pi{F) — ^ 2 (F) < inf/ e c F [/($)] — U 2 {F) = 0, where the equality follows from an application of 
the monotone convergence theorem. An analogous trick yields the reverse inequality and, therefore, 
Ui{F) = H 2 (F) for any F C 0 closed. Borel measures over 0 are inner regular (also known as 
tight; see Aliprantis and Border (2006), Ch. 12, Theorem 12.7). Thus, for any Borel set A C 0 
and any e > 0, there exists a F C A compact such that p,i(A \ F) < e for all i = 1,2. Therefore 
Ui(A) — ii 2 (A) < pi(A) — p, 2 (F) < pi(F) — p 2 {F) + e. By our previous result, it follows that 
pi (A) — P 2 (A) < e. A similar trick yields the reverse inequality and, since e is arbitrary, this implies 
that Pi(A) = p 2 (A) for all A C 0 Borel. 
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Online Appendix 
Proof of Lemmas 1 and 4. 

The proof of these lemmas is standard. We only prove Lemma 1; the proof of Lemma 
4 is analogous. 

Proof of Lemma 1. (i) Let Tq : L°°(S) —>■ L°°(S) be the Bellman operator, 
Tq[W}(s) = maxi e r(s) /§ ( 7r ( s , s') + <5W’(s / )} Q(ds'\s, x). By standard arguments, 

Tq is a contraction with modulus S, and so there is a unique fixed point Vq G L°°(§). 
To establish continuity in Q, let Vq u be a sequence of fixed points given Q n such that 
Q n converges to Q and let Vq be the fixed point given Q. Then 

I|Vq„ - Velli- < \\T Qn \V Qn } - Tq„{V q ]\\ l ~ + II T Qn [V Q \ - T q |Vq]|U» 

< S\\Vq, - Vq\\l~ + \\t Qu [Vq\ - T q [Vq\ iu» 

and, since 5 G [0,1), it only remains to show that HLq^Vq] — Tq[Vq\ \\l°° —» 0. Note 
that, for any s G §, 

T Q n [ v o \( s ) - t q[ v q \( s ) < [ (n(s,x n ,s')+5V Q (s')){Q n (ds'\s,x n ) - Q(ds'\s,x n )} 

J s 

where x n G argmax f s {7r(s, x, s') + <5Vq(s / )} Q n (ds'\s, x). Since Vq and 7r are in 
L°°(§) and |S| < oo, it follows that T Qn [V Q \(s) -T q [Vq\(s) < C\\Q n - Q\ \ for some fi¬ 
nite constant C. Using similar arguments, one can show that T Q [V Q \(s)-T Qn [V Q ](s) < 
C\\Q n - Q\\. Therefore, - T Q [V Q \ || L «. < C\\Q n - Q\\ and the desired result 

follows because ||<5n “ Q\ \ —> 0. 

(ii) For each s G S and Q G A(§) Gr ^ r \ let X S (Q) = argmax^ e r( s ) U s (x,Q ), where 
U s (x, Q ) = f § {vr(s, x, s') + hVQ(s')} Q(ds'\s, x). Note that 

S(<5) = {cr G S: Vs G S, cr(-|s) G A(X S (Q))} is isomorphic to x se §A(A s (Q)), in 
the sense that a G T(Q) iff (a(-|si),...., cr(-|s|§|)) G x se §A(A s (Q)). By part (i), U s 
is continuous, and so the Theorem of the Maximum implies that X S (Q ) is nonempty, 
compact-valued, and upper hemicontinuous in Q. By Theorem 17.13 in Aliprantis 
and Border (2006), Q t-G A(X S (Q)) is also non-empty, compact-valued and upper 
hemicontinuous for each s G §. By Tychonoff’s Theorem, so is x sg §A(A" s (Q)), and 
consequently £(Q). Finally, to establish convexity of E(Q), let cr,cr' G £(Q), ot G 
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(0,1) and a a = aa + (1 — a)a'. Then, for all s G S, suppcr a (■ \ s ) = suppa{- 
s ) U suppa(■ | s) C X S (Q), and so cTq, G E(Q). □ 


Proof of Claims A, B, and C 

Notation. For the proofs of Claim A and B, let Z = § x Gr(T). For each z = 
(s',s,x) G Z and m G A(Gr(r)), define P m {z) = Q(s' \ s,x)m(s,x). We sometimes 
abuse notation and write Q(z) = Q(s' | s,x), and similarly for Q g . 

Proof of Claim A. (i) By regularity and finiteness of Z, there exists 0* G 0 
and a G (0, 1) such that Q g *(z) > a for all z G Z such that Q(z) > 0. Thus, for all 
m G A(GV(r)), K Q (m,9*) < -Ep m [lnQ e *(Z)] < -In a. 

(ii) K Q (m n ,9) - K Q (m,6) = Ez : q(*)>o(-P"*»(*) - P m {z))(lnQ(z) - In Qg{z)). By 
the assumption that Qg(z) > 0 for all z such that Q(z) > 0, (lnQ(A) — In Qq(z)) is 
bounded for all z such that Q(z) > 0. In addition, P mn (z) — P m (z) converges to zero 
for all z E Z due to linearity of P. and due to convergence of m n to m. 

P mn (z) In Qe n (z)). The first term in the RHS converges to zero (same argument as 
Claim A(ii)). The proof concludes by showing that, for all z, 


lim inf -P mn (z) lnQg n (z) > -P m {z) luQg(z). (22) 

n—>oo 

Suppose liminC-Kjo — P mn (z) In Qg n (z) < M < oo (if not, (22) holds trivially). Then 
either (i) P mn ( z ) Pm(z) > 0, in which case (22) holds with equality by continuity 
of Qe(z) in 6, or (ii) P mn (z) —$■ P m (z) = 0, in which case (22) holds because its RHS 
is zero (by convention that OlnO = 0) and its LHS is always nonnegative. □ 


Proof of Claim B. (i) For any z G Z and any h°° G TL, let freq t (h°°)(z ) = 
Observe that 1 Et=i lo g(^gfe3^) = K u(h°°) + ^ ~ 

« 3 t(h°°,9), where t (h°°) = E ze z ( freq t (h°°)(z ) - P m (z)) In Q(z), n 2 = q(z)> o 

and K 3 t{h°°,8 ) = f re( lt(h 00 )(z) In Qg(z). 

We first show that lim^oo = 0 a.s.-PA To do this, let g t (h°°,z ) = 

(l{ 2 }(z T ) — P m (z )) In Q(z), and observe that (g t (-,z)) t is a martingale difference se¬ 
quence. Let h* denote the partial history until time t and L t (h°°, z) = Et=i r 1 5 , r(^°°, z)\ 


»InQ(z), 
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note that E P j^ h t^ [L 4+1 (h°°, z)\ = L t (h°°, z ) and so (L t (-, z)) t is a martingale with re¬ 
spect to P 7 . Moreover, E P f^ h t^[\g t {h°°, z)\ 2 ] < (In Q{z)) 2 Q(z), which is bounded 
by 1; this result, the Law of iterated expectations and the fact that (g t (-,z))t are 
uncorrelated, imply that sup t E P f [\L t (h°°, z)\ 2 ] < M for M < oo. Hence, by the 
Martingale convergence Theorem (see Theorem 5.2.8 in Durrett (2010)) L t (h°°,z ) 
converges a.s.-P 7 to a finite L^ih 00 , z). By Kronecker’s lemma (Pollard (2001), page 
105) 32 , Hindoo E 1 Y1t=i 9 r(h°°, z) = 0 a.s.-P- 7 , for all (uniformly) z e Z. Thus, 
linp-s-oo K lt (h°°) = 0 a.s.-P 7 . 

We also note that analogous arguments show that 


lim freq t (h°°,z ) = P m (z) 

t—> OO 


a.s.-P 7 , for all (uniformly) zfZ. 

Since 8 e 0, z (->■ — log (Qg(z)) is bounded. Thus by analogous arguments to 
those used to show lim^oo K\t{h°°) = 0 a.s.-P- 7 , it follows that, for any 8 e 0, 
lim^oo k 3 = Yhz&i Prn{z) In Qg(z) a.s.-P 7 . This result and the fact that 


lim^oo Ki t (h°°) = 0 a.s.-P 7 , imply that lmp^ t 1 Y? r =i lo S 


Q{St\St-U%t-i) \ _ 


Szez 


z) log 


QE \ _ 
Qe(z) 




x)eGr(r) Eq(-\s,x) 


log 


Q 6 (,St\St — 1 i-Tt — 1) 
<2(S'|s,;r) 

Qe(.S'\s,x) 


m(s, x) for any 8 e 


0 a.s.-P- 7 , as desired. 

(ii) For any £ > 0, define to be the set such that 8 6 0 m ^ if and only if 
Qe{z) > £ for all z such that P m {z) > 0. For any £ > 0, let Q = —a/(#Z41n£) > 0. 
By the fact that Hindoo freq t (h°° , z) = P m (z ) a.s.-P 7 , for all (uniformly) z e Z, 3 
such that, Wt > 


«s t(h°°, 8 )< freq t (h°°)(z)lnQ e (z) < ]T (P m (z) - Q) InQg(z) 

{z:P m (z)>0} {z i :P m (z)> 0} 

< E Q(-M [l n( 3e(‘S' / I S,x)\m(s, x) - #ZC e ln£, 

(s,x)6Gr(r) 


a.s.-P 7 andV0 e 0 m ^ (since Q$(z) > £ Vz such that P m (z) > 0). The above expres¬ 
sion, the fact that a/4 — —#Z£c In £, and the fact that E 1 Y^ T =\ log ( Qgls^s^r ^ ~ 


32 This lemma implies that for a sequence (£ t )t if E < oo, then T^E ~t 0 where (b t )t is 

a nondecreasing positive real valued sequence that diverges to oo. We can apply the lemma with 
t t = t~ 1 g t and b t = t. 
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Kit(h°°) + k 2 ~ t(h°°, 0) imply that Vt > 


t 1 ^log 


X T — l) / 


— Eq(-\s,x) 

(s,i)£Gr(r) 


In 


Q(S' | s,x) 
Qe(S' | s,x) 


m(s, x ) 


a 

4 


K Q (m, 6) —-a, 


(23) 


a.s.-P-^ andV# G 0 m ^. For any 0 G {0: d m {6 ) > e}fl0 m ^, the RHS is bounded below 
by K*{m) + 3 a. — \a > K*(m ) + |a. 

Moreover, since lim^oo freq t (h°°, z) = P m (z) a.s.-P 7 (uniformly over 0 G Z), 
there exists a T such that for any 9 ^ 0 m ^, K,3 t (h°°,9) < freq t (z) In Qg(z) < 
(Pl/ 2)ln£ for all t > T(£) and some z G Z where Pl = min z {P m (^) : P m (z) > 0}. 
Therefore, for any 9 ^ 0 m ^ and a.s.-P^ : 


t 1 ^ 1 °g 


( I^T— 1 ? %T— l) \ 

\ Qo (^r | 1 i l) / 


P m (z) lnQ(z) - (p L /2) hi ^ 

zGZ: Q ( z )>0 


(24) 


for all t > T(£). Observe that ^) zgZ . q( z )>o Pm( z ) hiQ(^) and K*(m ) are bounded, so 
there exists a £(a) such that the RHS can be made larger than K*(m ) + fee. 

Therefore, by displays 23 and 24, it follows that: For any t >T = max-ft^ } , T(£(a))} 
and a,.s.-P J 


t ^log 


/ Q(s t |s T _i , ic r _ i) \ 

\ (s T |s r _i, ic r _i) y 


> K*(m ) + -a 


for all 9 G {0: d m (9) > e}, as desired. □ 


Proof of Claim C. We first show that for any (s',x') G Gr(T) and (s 0 ,x 0 ) e 
Gr(T), there exists an n such that M£q(s', x' | s 0 , x 0 ) > 0, where M™q = M a> Q ■ ■ ■ M U} q . 33 
By the condition in Definition 18, there exist an n and a “path” ((si,xi),..., (s n ,x n )) 
such that (si,Xi) G Gr(T) for all i — 1, ...,n and 


Q(.S I 5 n , 3 ) n )Q(s n \ ^n— 1 ) ^n— l)-"Q(^l | 5053)0) 0 . 

33 The expression M„ q ■ M a Q is defined as a transition probability function over S x X where 
M^q ■ M a ^ Q (s', x' | s,x) = J2( a ,b) M ( s '’ x ' I a,b)M(a,b \ s,x). The expression M a>Q ■ ■ ■ M a>Q is 
constructed by successive iterations of the previous one. 
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This inequality and the fact that a(x | s) > £ for all (s,x) G Gr(T), imply that 
M™ q (s',x' | s 0 ,x 0 ) = | s')Q(s' | s n ,x n )...a(xi | Si)Q(si | s 0 ,x 0 ) 

((si,xi),...,(a„,x„)) 

> ^ n+1 51 I s„,a;„)...Q(3i | s 0 , A)) > 0, 

((si,xi),...,(s n ,x,,)) 

as desired. 

Consider any invariant distribution m. There exists at least one point (sq,^o) G 
Gr(T) such that m(s 0 ,x 0 ) > 0. For any (s',x') G Gr(T), let n be the integer that en¬ 
sures that M^q (s', x' j so,^o) > 0. Then, it follows that m(s', x') = X)( s x)eGr(r) M£ q (s',x' 
s,x)m(s,x ) > M™q(s',x' | s 0 , x 0 )m(s 0 , x 0 ) > 0. Thus, supp(m) = Gr(T). □ 


Computing Oq(-) and the stationary distribution in the search 
example 

Claim D. (i) Let o be a strategy characterized by a threshold w*. Then there is a 
unique stationary marginal distribution over X, mx(-; w*), and it is given by 


m x ( 0 ; w*) 


m - (1 - F(w*))E[ At] 
{l#F(w*)){E[\]-E[\'y]} + E['yY 


(ii) For any m G A(Gr(r)) with marginal m x G A(X), @g(m) is a singleton given 
by 


m x(0) y C _ m x(0) \ 

m x (0) + m x (l) (E [ 7 ]) V m x (0) + m x (l)7/ 


Cov^h A) ^ 


Proof of Claim D. (i) For any m G A(Gr(r)), z',x' and dC§ Borel, let 


m(z',A,x')= / / a(x'\w , )Q(z', A\w, x)m(w, x)dwdx, 


where { z'},A,x' is just notation for the set {z'} x Ax {x'} and Q(z',A \ w,x) = 
Pr(A\z' ,w,x)G(z'), with 


Pr (v/ G A\z', w, 0) 


f A F(dw') w/ prob. A(z') 
1(0 G A} w/ prob. (1 — A(V)) 
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and 


r 

Pr [w' G A\z', w, 1) = < 


f A F(dw > ) w/pr. A(Y) 
1{0 G A} w/pr. (1 — A (z')) 
l{w G A} 


w/pr. 7 (z') 
w/pr. 1 — 7 (z') 


Also a{l\w) = 1 {w > w*}. Hence, for x' — 1 

m(z f , §, 1 ; w*) = m{z', {/«/ > w*}, 1 ; «;*) 


and similarly for a ; 7 = 0. Thus, m x ( 1; w*) = J z m(dz', {w 7 > w*}, 1 ; w*) and m x (0; w*) = 
J z m(dz',{w' < w*},0;ub) ( w 7 = w* occurs with probability zero, so it can be ig¬ 
nored). ft thus follows that 


w = 




a(x'\w')Q(dz ', {w' > w*}\w , x)m(w, x ; w*)dwdx 
Pr ({w 7 > tc*}|^ 7 , w, x) G(dz')m(w, x ; w*)dwdx 


Pr ({w/ > w*}|z 7 , 0) G(dz')m(w , 0; w*)dw 


+ Pr ({-u/ > w*}|z 7 , w, 1) G(dz')m{w , 1; w*)dw 




= / A(V)G(ck 7 )(l — P(w*))mx(0; w*) + j ^{z')\{z'){l — F{w*))G{dz')rnx{^w* 


+ (l - ^z'))G(dz>) l{w> w*}m(dw , 1 ; it;*). 


where the last line follows from the fact that l{w 7 > w*}l{w 7 = 0 } = 0 always. 
Observe that J w l{w > w*}m(dw, 1; w*) = m({w > ub},l;w*) = mx(l;w*) by our 
previous observation. Thus 


mx(l;0 = [A](1 —F(w*))mx(0; w*) + |£J[A 7 ](1 - F(w*)) + (1 - E[ 7 ])} m x (l; w*). 


Solving for m x ( 1; u>*), we obtain 


m x (l;iu*) 


-^[A](l ~ F(w*)) 

(1-F( W *)){^[A]- j E[A 7 ]} + j E[ 7 ]' 
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The result follows by noting that m x (0; w*) — 1 — m x ( 1; u/). 

(ii) For re = 1, 

%ki) ln (^^y7T)) { A (V)^ lo g (O)F(dw') + (1 - A(V)) log (1 -#)|g(z') 

+ £(i -tM) {log (1 )}CM 

z'eZ 

=E[\~f] log (6 ) + (E[ 7 ] — -E[Ay]) log (1 — 9) + Const. 

Similarly, for x — 0, 

to = £[A] log(») + (1 - S[A])log (1 -»)+ Const’, 

where Const and Const' are constants that do not depend on 6. It is easy to see 
that, over [0,1], these are strictly convex functions of 9, so a convex combination also 
is. Thus @q(m) is a singleton for any m, which we denote as $< 3 ( 771 ). The first order 
conditions yield 

^ {£[A 7 ]m x (l) + £’[A]m x (0)} = {(E[ 7 ] - £^ 7 ]) m x (l) + (1 - £[A])m x (0)} . 

Thus 

£’[A 7 ]m x (l) + Am x (0) 

Q m 7 m x (l) + m x ( 0 ) 

The desired results follows from some algebra and from the standard expression for 
the covariance. □ 
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